ITA
ENG

EXTENDING THE TREND VECTOR - THE TREND MATRIX AND SAMPLE-BASED PARTIAL LEAST-SQUARES

Authors

SHERIDAN RP NACHBAR RB BUSH BL

Citation

Rp. Sheridan et al., EXTENDING THE TREND VECTOR - THE TREND MATRIX AND SAMPLE-BASED PARTIAL LEAST-SQUARES, Journal of computer-aided molecular design, 8(3), 1994, pp. 323-340

Citations number

Categorie Soggetti

Biology

Journal title

Journal of computer-aided molecular design → ACNP

ISSN journal

0920654X

Volume

Issue

Year of publication

1994

Pages

323 - 340

Database

ISI

SICI code

0920-654X(1994)8:3<323:ETTV-T>2.0.ZU;2-Y

Abstract

Trends vector analysis [Cathart, R.E. et al., J. Chem. Inf. Comput. Sc i., 25 (1985) 64], in combination with topological descriptors such as atom pairs, has proved useful in drug discovery for ranking large col lections of chemical compounds in order of predicted biological activi ty. The compounds with the highest predicted activities, upon being te sted, often show a several-fold increase in the fraction of active com pounds relative to a randomly selected set. A trend vector is simply t he one-dimensional array of correlations between the biological activi ty of interest and a set of properties or 'descriptors' of compounds i n a training set. This paper examines two methods for generalizing the trend vector to improve the predicted rank order. The trend matrix me thod finds the correlations between the residuals and the simultaneous occurrence of descriptors, which are stored in a two-dimensional anal og of the trend vector. The SAMPLS method derives a linear model by pa rtial least squares (PLS), using the 'sample-based' formulation of PLS [Bush, B.L. and Nachbar, R.B., J. Comput.-Aided Mel. Design, 7 (1993) 587] for efficiency in treating the large number of descriptors. PLS accumulates a predictive model as a sum of linear components. Expresse d as a vector of prediction coefficients on properties, the first PLS component is proportional to the trend vector. Subsequent components a djust the model toward full least squares. For both methods the residu als decrease, while the risk of overfitting the training set increases . We therefore also describe statistical checks to prevent overfitting . These methods are applied to two data sets, a small homologous serie s of disubstituted piperidines, tested on the dopamine receptor, and a large set of diverse chemical structures, some of which are active at the muscarinic receptor. Each data set is split into a training set a nd a test set, and the activities in the test set are predicted from a fit on the training set. Both the trend matrix and the SAMPLS approac h improve the predictions over the simple trend vector. The SAMPLS app roach is superior to the trend matrix in that it requires much less st orage and CPU time. It also provides a useful set of axes for visualiz ing properties of the compounds. We describe a randomization method to determine the optimum number of PLS components that is very much fast er for large training sets than leave-one-out cross-validation.