In QSAR analysis in environmental sciences adverse effects of chemical
s released to the environment are modelled and predicted as a function
of the chemical properties of the pollutants. Usually, the set of com
pounds under study contains several classes of substances, i.e., a mor
e or less strongly clustered set. It is then needed to ensure that the
selected training set comprises compounds representing all those chem
ical classes. Multivariate design in the principal properties of the c
ompound classes is usually appropriate for selecting a meaningful trai
ning set. However, with clustered data, often seen in environmental ch
emistry and toxicology, a single multivariate design may be suboptimal
. This because of the risk of ignoring small classes with few members
and only selecting training set compounds from the largest classes. In
this paper, a procedure for training set selection recognizing cluste
ring is proposed. Here, when non-selective biological or environmental
responses are modelled, local multivariate designs are constructed wi
thin each cluster (class). The chosen compounds arising from the local
designs are finally united in the overall training set, which thus wi
ll contain members from all clusters. Our illustration deals with a se
t of 66 compounds, categorized into five classes, for which the soil s
orption coefficient is available. The training set selection is discus
sed, followed by multivariate QSAR modelling, model validation and int
erpretation, and predictions for the test set.