Cluster Significance Analysis (CSA) is a method for analysing embedded data
sets, i.e. data sets in which the objects (chemicals) are divided into two
classes (active/inactive or toxic/non-toxic) and in which one class of obj
ects (typically, the active or toxic chemicals) is found to cluster along o
ne or more variables (e.g. physicochemical descriptors), forming an 'embedd
ed cluster' surrounded by the 'diffuse cluster' of objects in the other cla
ss (typically, the inactive or non-toxic chemicals). The aim of CSA is to i
dentify variables along which clustering is statistically significant. Havi
ng identified significant variables, the investigator may wish to derive a
model for classifying active and inactive chemicals on the basis of these v
ariables. In this paper, a method called 'embedded cluster modelling' (ECM)
is proposed for the derivation of such classification models. If ECM is ap
plied to a single variable, the resulting model consists of two cut-off val
ues (an upper and a lower limit) between which the active (toxic) chemicals
are predicted to lie. If ECM is applied to two or more variables, the resu
lting model is best described as an 'elliptic model' of cluster membership,
since the active (or toxic) chemicals are predicted to lie inside the boun
dary of a two-dimensional or three-dimensional ellipse, which is regarded a
s the boundary of the embedded cluster. The combined use of CSA and ECM for
the analysis of embedded data sets is illustrated by their application to
a data set of methacycline derivatives. The algorithms for CSA and ECM have
been coded in the form of Minitab macros, which the authors are making fre
ely available.