Several predictive models of aqueous solubility have been published. They h
ave good performances on the data sets which have been used for training th
e models, but usually these data sets do not contain many structures simila
r to the structures of interest to the drug research and their applicabilit
y in drug hunting is questionable. A very diverse data set has been gathere
d with compounds issued from literature reports and proprietary compounds.
These compounds have been grouped in a so-called literature data set I, a p
roprietary data set II, and a mixed data set III formed by I and II. About
100 descriptors emphasizing surface properties were calculated for every co
mpound. Bayesian learning of neural nets which cumulates the advantages of
neural nets without having their weaknesses was used to select the most par
simonious models and train them, from I, II, and III. The models were estab
lished by either selecting the most efficient descriptors one by one using
a modified Gram-Schmidt procedure (GS) or by simplifying a most complete mo
del using automatic relevance procedure (ARD). The predictive ability of th
e models was accessed using validation data sets as much unrelated to the t
raining sets as possible, using two new parameters: NDDx,ref the normalized
smallest descriptor distance of a compound x to a reference data set and C
Dx,mod the combination of NDDx,ref with the dispersion of the Bayesian neur
al nets calculations. The results show that it is possible to obtain a gene
ric predictive model from database I but that the diversity of database II
is too restricted to give a model with good generalization ability and that
the ARD method applied to the mixed database III gives the best predictive
model.