SAMPLE-DISTANCE PARTIAL LEAST-SQUARES - PLS OPTIMIZED FOR MANY VARIABLES, WITH APPLICATION TO COMFA

Citation
Bl. Bush et Rb. Nachbar, SAMPLE-DISTANCE PARTIAL LEAST-SQUARES - PLS OPTIMIZED FOR MANY VARIABLES, WITH APPLICATION TO COMFA, Journal of computer-aided molecular design, 7(5), 1993, pp. 587-619
Citations number
36
Categorie Soggetti
Biology
ISSN journal
0920654X
Volume
7
Issue
5
Year of publication
1993
Pages
587 - 619
Database
ISI
SICI code
0920-654X(1993)7:5<587:SPL-PO>2.0.ZU;2-R
Abstract
Three-dimensional molecular modeling can provide an unlimited number m of structural properties. Comparative Molecular Field Analysis (CoMFA ), for example, may calculate thousands of field values for each model structure. When m is large, partial least squares (PLS) is the statis tical method of choice for fitting and predicting biological responses . Yet PLS is usually implemented in a property-based fashion which is optimal only for small m. We describe here a sample-based formulation of PLS which can be used to fit any single response (bioactivity). SAM PLS reduces all explanatory data to the pairwise 'distances' among n s amples (molecules), or equivalently to an n-by-n covariance matrix C. This matrix, unmodified, can be used to fit all PLS components. Furthe rmore, SAMPLS will validate the model by modern resampling techniques, at a cost independent of m. We have implemented SAMPLS as a Fortran p rogram and have reproduced conventional and cross-validated PLS analys es of data from two published studies. Full (leave-each-out) cross-val idation of a typical CoMFA takes 0.2 CPU s. SAMPLS is thus ideally sui ted to structure-activity analysis based on CoMFA fields or bonded top ology. The sample-distance formulation also relates PLS to methods lik e cluster analysis and nonlinear mapping, and shows how drastically PL S simplifies the information in CoMFA fields.