L. Eriksson et al., On the selection of the training set in environmental QSAR analysis when compounds are clustered, J CHEMOMETR, 14(5-6), 2000, pp. 599-616
In QSAR analysis in environmental sciences, adverse effects of chemicals re
leased to the environment an modelled and predicted as a function of the ch
emical properties of the pollutants. Usually the set of compounds under stu
dy contains several classes of substances, i.e. a more or less strongly clu
stered set. It is then needed to ensure that the selected training set comp
rises compounds representing all those chemical classes. Multivariate desig
n in the principal properties of the compound classes is usually appropriat
e for selecting a meaningful training set. However, with clustered data, of
ten seen in environmental chemistry and toxicology, a single multivariate d
esign may be suboptimal because of the risk of ignoring small classes with
few members and only selecting training set compounds from the largest clas
ses. Recently a procedure for training set selection recognizing clustering
was proposed by us. In this approach, when non-selective biological or env
ironmental responses are modelled, local multivariate designs are construct
ed within each cluster (class). The chosen compounds arising from the local
designs are finally united in the overall training set, which thus will co
ntain members from all clusters. The proposed strategy is here further test
ed and elaborated by applying it to a series of 351 chemical substances for
which the soil sorption coefficient is available. These compounds are divi
ded into 14 classes containing between 10 and 52 members. The training set
selection is discussed, followed by multivariate QSAR modelling, model inte
rpretation and predictions for the test set. Various types of statistical e
xperimental designs are tested during the training set selection phase. Cop
yright (C) 2000 John Wiley & Sons, Ltd.