ITA
ENG

Embedded Cluster Modelling - A novel method for analysing embedded data sets

Authors

Worth, AP Cronin, MTD

Citation

Ap. Worth et Mtd. Cronin, Embedded Cluster Modelling - A novel method for analysing embedded data sets, QSAR, 18(3), 1999, pp. 229-235

Citations number

Categorie Soggetti

Chemistry & Analysis

Journal title

QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS

ISSN journal

09318771 → ACNP

Volume

Issue

Year of publication

1999

Pages

229 - 235

Database

ISI

SICI code

0931-8771(199907)18:3<229:ECM-AN>2.0.ZU;2-B

Abstract

Cluster Significance Analysis (CSA) is a method for analysing embedded data sets, i.e. data sets in which the objects (chemicals) are divided into two classes (active/inactive or toxic/non-toxic) and in which one class of obj ects (typically, the active or toxic chemicals) is found to cluster along o ne or more variables (e.g. physicochemical descriptors), forming an 'embedd ed cluster' surrounded by the 'diffuse cluster' of objects in the other cla ss (typically, the inactive or non-toxic chemicals). The aim of CSA is to i dentify variables along which clustering is statistically significant. Havi ng identified significant variables, the investigator may wish to derive a model for classifying active and inactive chemicals on the basis of these v ariables. In this paper, a method called 'embedded cluster modelling' (ECM) is proposed for the derivation of such classification models. If ECM is ap plied to a single variable, the resulting model consists of two cut-off val ues (an upper and a lower limit) between which the active (toxic) chemicals are predicted to lie. If ECM is applied to two or more variables, the resu lting model is best described as an 'elliptic model' of cluster membership, since the active (or toxic) chemicals are predicted to lie inside the boun dary of a two-dimensional or three-dimensional ellipse, which is regarded a s the boundary of the embedded cluster. The combined use of CSA and ECM for the analysis of embedded data sets is illustrated by their application to a data set of methacycline derivatives. The algorithms for CSA and ECM have been coded in the form of Minitab macros, which the authors are making fre ely available.