QSAR with few compounds and many features

Citation
Dm. Hawkins et al., QSAR with few compounds and many features, J CHEM INF, 41(3), 2001, pp. 663-670
Citations number
30
Categorie Soggetti
Chemistry
Journal title
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES
ISSN journal
00952338 → ACNP
Volume
41
Issue
3
Year of publication
2001
Pages
663 - 670
Database
ISI
SICI code
0095-2338(200105/06)41:3<663:QWFCAM>2.0.ZU;2-9
Abstract
Fitting quantitative structure-activity relationships (QSAR) requires diffe rent statistical methodologies and, to some degree, philosophies depending on the "shape" of the data matrix. When few features are used and there are many compounds, it is a reasonable expectation that good feature subset se lection may be made and that nonlinearities and nonadditivities can be dete cted and diagnosed. Where there are many features and few compounds, this i s unrealistic. Methods such as ridge regression RR, PLS, and principal comp onent regression PCR, which abjure feature selection and rely on linearity may provide good predictions and fair understanding. We report a developmen t of ridge regression for the underdetermined case by using generalized cro ss-validation to choose the ridge constant and perform F-tests for addition al information. Conventional regression diagnostics can be used in followup to identify nonlinearities and other departures from model. We illustrate the approach with QSAR models of four data sets using calculated molecular descriptors.