Bl. Bush et Rb. Nachbar, SAMPLE-DISTANCE PARTIAL LEAST-SQUARES - PLS OPTIMIZED FOR MANY VARIABLES, WITH APPLICATION TO COMFA, Journal of computer-aided molecular design, 7(5), 1993, pp. 587-619
Three-dimensional molecular modeling can provide an unlimited number m
of structural properties. Comparative Molecular Field Analysis (CoMFA
), for example, may calculate thousands of field values for each model
structure. When m is large, partial least squares (PLS) is the statis
tical method of choice for fitting and predicting biological responses
. Yet PLS is usually implemented in a property-based fashion which is
optimal only for small m. We describe here a sample-based formulation
of PLS which can be used to fit any single response (bioactivity). SAM
PLS reduces all explanatory data to the pairwise 'distances' among n s
amples (molecules), or equivalently to an n-by-n covariance matrix C.
This matrix, unmodified, can be used to fit all PLS components. Furthe
rmore, SAMPLS will validate the model by modern resampling techniques,
at a cost independent of m. We have implemented SAMPLS as a Fortran p
rogram and have reproduced conventional and cross-validated PLS analys
es of data from two published studies. Full (leave-each-out) cross-val
idation of a typical CoMFA takes 0.2 CPU s. SAMPLS is thus ideally sui
ted to structure-activity analysis based on CoMFA fields or bonded top
ology. The sample-distance formulation also relates PLS to methods lik
e cluster analysis and nonlinear mapping, and shows how drastically PL
S simplifies the information in CoMFA fields.