ITA
ENG

VARIABLE SELECTION IN QSAR STUDIES .1. AN EVOLUTIONARY ALGORITHM

Authors

KUBINYI H

Citation

H. Kubinyi, VARIABLE SELECTION IN QSAR STUDIES .1. AN EVOLUTIONARY ALGORITHM, Quantitative structure-activity relationships, 13(3), 1994, pp. 285-294

Citations number

Categorie Soggetti

Pharmacology & Pharmacy

Journal title

Quantitative structure-activity relationships → ACNP

ISSN journal

09318771

Volume

Issue

Year of publication

1994

Pages

285 - 294

Database

ISI

SICI code

0931-8771(1994)13:3<285:VSIQS.>2.0.ZU;2-O

Abstract

In QSAR studies of large data sets, variable selection and model build ing is a difficult, time-consuming and ambiguous procedure. While most often stepwise regression procedures are applied for this purpose, ot her strategies, like neural networks, cluster significance analysis or genetic algorithms have been used. A simple and efficient evolutionar y strategy, including iterative mutation and selection, but avoiding c rossover of regression models, is described in this work. The MUSEUM ( Mutation and Selection Uncover Models) algorithm starts from a model c ontaining any number of randomly chosen variables. Random mutation, fi rst by addition or elimination of only one or very few variables, afte rwards by simultaneous random additions, eliminations and/or ex change s of several variables at a time, leads to new models which are evalua ted by an appropriate fitness function. In contrast to common genetic algorithm procedures, only the ''fittest'' model is stored and used fo r further mutation and selection, leading to better and better models. In the last steps of mutation, all variables inside the model are eli minated and all variables outside the model are added, one by one, to control whether this systematic strategy detects any mutation which st ill improves the model. After every generation of a better model, a ne w random mutation procedure starts from this model. In the very last s tep, variables not significant at the 95% level are eliminated, starti ng with the least significant variable. In this manner, ''stable'' mod els are produced, containing only significant variables. A comparison of the results for the Selwood data set (n = 31 compounds, k = 53 vari ables) with those obtained by other groups shows that more relevant mo dels are derived by the evolutionary approach than by other methods.