VARIABLE SELECTION IN QSAR STUDIES .1. AN EVOLUTIONARY ALGORITHM

Authors
Citation
H. Kubinyi, VARIABLE SELECTION IN QSAR STUDIES .1. AN EVOLUTIONARY ALGORITHM, Quantitative structure-activity relationships, 13(3), 1994, pp. 285-294
Citations number
33
Categorie Soggetti
Pharmacology & Pharmacy
ISSN journal
09318771
Volume
13
Issue
3
Year of publication
1994
Pages
285 - 294
Database
ISI
SICI code
0931-8771(1994)13:3<285:VSIQS.>2.0.ZU;2-O
Abstract
In QSAR studies of large data sets, variable selection and model build ing is a difficult, time-consuming and ambiguous procedure. While most often stepwise regression procedures are applied for this purpose, ot her strategies, like neural networks, cluster significance analysis or genetic algorithms have been used. A simple and efficient evolutionar y strategy, including iterative mutation and selection, but avoiding c rossover of regression models, is described in this work. The MUSEUM ( Mutation and Selection Uncover Models) algorithm starts from a model c ontaining any number of randomly chosen variables. Random mutation, fi rst by addition or elimination of only one or very few variables, afte rwards by simultaneous random additions, eliminations and/or ex change s of several variables at a time, leads to new models which are evalua ted by an appropriate fitness function. In contrast to common genetic algorithm procedures, only the ''fittest'' model is stored and used fo r further mutation and selection, leading to better and better models. In the last steps of mutation, all variables inside the model are eli minated and all variables outside the model are added, one by one, to control whether this systematic strategy detects any mutation which st ill improves the model. After every generation of a better model, a ne w random mutation procedure starts from this model. In the very last s tep, variables not significant at the 95% level are eliminated, starti ng with the least significant variable. In this manner, ''stable'' mod els are produced, containing only significant variables. A comparison of the results for the Selwood data set (n = 31 compounds, k = 53 vari ables) with those obtained by other groups shows that more relevant mo dels are derived by the evolutionary approach than by other methods.