Embedded Cluster Modelling - A novel method for analysing embedded data sets

Citation
Ap. Worth et Mtd. Cronin, Embedded Cluster Modelling - A novel method for analysing embedded data sets, QSAR, 18(3), 1999, pp. 229-235
Citations number
10
Categorie Soggetti
Chemistry & Analysis
Journal title
QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS
ISSN journal
09318771 → ACNP
Volume
18
Issue
3
Year of publication
1999
Pages
229 - 235
Database
ISI
SICI code
0931-8771(199907)18:3<229:ECM-AN>2.0.ZU;2-B
Abstract
Cluster Significance Analysis (CSA) is a method for analysing embedded data sets, i.e. data sets in which the objects (chemicals) are divided into two classes (active/inactive or toxic/non-toxic) and in which one class of obj ects (typically, the active or toxic chemicals) is found to cluster along o ne or more variables (e.g. physicochemical descriptors), forming an 'embedd ed cluster' surrounded by the 'diffuse cluster' of objects in the other cla ss (typically, the inactive or non-toxic chemicals). The aim of CSA is to i dentify variables along which clustering is statistically significant. Havi ng identified significant variables, the investigator may wish to derive a model for classifying active and inactive chemicals on the basis of these v ariables. In this paper, a method called 'embedded cluster modelling' (ECM) is proposed for the derivation of such classification models. If ECM is ap plied to a single variable, the resulting model consists of two cut-off val ues (an upper and a lower limit) between which the active (toxic) chemicals are predicted to lie. If ECM is applied to two or more variables, the resu lting model is best described as an 'elliptic model' of cluster membership, since the active (or toxic) chemicals are predicted to lie inside the boun dary of a two-dimensional or three-dimensional ellipse, which is regarded a s the boundary of the embedded cluster. The combined use of CSA and ECM for the analysis of embedded data sets is illustrated by their application to a data set of methacycline derivatives. The algorithms for CSA and ECM have been coded in the form of Minitab macros, which the authors are making fre ely available.