In QSAR studies of large data sets, variable selection and model build
ing is a difficult, time-consuming and ambiguous procedure. While most
often stepwise regression procedures are applied for this purpose, ot
her strategies, like neural networks, cluster significance analysis or
genetic algorithms have been used. A simple and efficient evolutionar
y strategy, including iterative mutation and selection, but avoiding c
rossover of regression models, is described in this work. The MUSEUM (
Mutation and Selection Uncover Models) algorithm starts from a model c
ontaining any number of randomly chosen variables. Random mutation, fi
rst by addition or elimination of only one or very few variables, afte
rwards by simultaneous random additions, eliminations and/or ex change
s of several variables at a time, leads to new models which are evalua
ted by an appropriate fitness function. In contrast to common genetic
algorithm procedures, only the ''fittest'' model is stored and used fo
r further mutation and selection, leading to better and better models.
In the last steps of mutation, all variables inside the model are eli
minated and all variables outside the model are added, one by one, to
control whether this systematic strategy detects any mutation which st
ill improves the model. After every generation of a better model, a ne
w random mutation procedure starts from this model. In the very last s
tep, variables not significant at the 95% level are eliminated, starti
ng with the least significant variable. In this manner, ''stable'' mod
els are produced, containing only significant variables. A comparison
of the results for the Selwood data set (n = 31 compounds, k = 53 vari
ables) with those obtained by other groups shows that more relevant mo
dels are derived by the evolutionary approach than by other methods.