On the selection of the training set in environmental QSAR analysis when compounds are clustered

Citation
L. Eriksson et al., On the selection of the training set in environmental QSAR analysis when compounds are clustered, J CHEMOMETR, 14(5-6), 2000, pp. 599-616
Citations number
28
Categorie Soggetti
Spectroscopy /Instrumentation/Analytical Sciences
Journal title
JOURNAL OF CHEMOMETRICS
ISSN journal
08869383 → ACNP
Volume
14
Issue
5-6
Year of publication
2000
Pages
599 - 616
Database
ISI
SICI code
0886-9383(200009/12)14:5-6<599:OTSOTT>2.0.ZU;2-Y
Abstract
In QSAR analysis in environmental sciences, adverse effects of chemicals re leased to the environment an modelled and predicted as a function of the ch emical properties of the pollutants. Usually the set of compounds under stu dy contains several classes of substances, i.e. a more or less strongly clu stered set. It is then needed to ensure that the selected training set comp rises compounds representing all those chemical classes. Multivariate desig n in the principal properties of the compound classes is usually appropriat e for selecting a meaningful training set. However, with clustered data, of ten seen in environmental chemistry and toxicology, a single multivariate d esign may be suboptimal because of the risk of ignoring small classes with few members and only selecting training set compounds from the largest clas ses. Recently a procedure for training set selection recognizing clustering was proposed by us. In this approach, when non-selective biological or env ironmental responses are modelled, local multivariate designs are construct ed within each cluster (class). The chosen compounds arising from the local designs are finally united in the overall training set, which thus will co ntain members from all clusters. The proposed strategy is here further test ed and elaborated by applying it to a series of 351 chemical substances for which the soil sorption coefficient is available. These compounds are divi ded into 14 classes containing between 10 and 52 members. The training set selection is discussed, followed by multivariate QSAR modelling, model inte rpretation and predictions for the test set. Various types of statistical e xperimental designs are tested during the training set selection phase. Cop yright (C) 2000 John Wiley & Sons, Ltd.