Search for predictive generic model of aqueous solubility using Bayesian neural nets

Authors
Citation
P. Bruneau, Search for predictive generic model of aqueous solubility using Bayesian neural nets, J CHEM INF, 41(6), 2001, pp. 1605-1616
Citations number
59
Categorie Soggetti
Chemistry
Journal title
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES
ISSN journal
00952338 → ACNP
Volume
41
Issue
6
Year of publication
2001
Pages
1605 - 1616
Database
ISI
SICI code
0095-2338(200111/12)41:6<1605:SFPGMO>2.0.ZU;2-H
Abstract
Several predictive models of aqueous solubility have been published. They h ave good performances on the data sets which have been used for training th e models, but usually these data sets do not contain many structures simila r to the structures of interest to the drug research and their applicabilit y in drug hunting is questionable. A very diverse data set has been gathere d with compounds issued from literature reports and proprietary compounds. These compounds have been grouped in a so-called literature data set I, a p roprietary data set II, and a mixed data set III formed by I and II. About 100 descriptors emphasizing surface properties were calculated for every co mpound. Bayesian learning of neural nets which cumulates the advantages of neural nets without having their weaknesses was used to select the most par simonious models and train them, from I, II, and III. The models were estab lished by either selecting the most efficient descriptors one by one using a modified Gram-Schmidt procedure (GS) or by simplifying a most complete mo del using automatic relevance procedure (ARD). The predictive ability of th e models was accessed using validation data sets as much unrelated to the t raining sets as possible, using two new parameters: NDDx,ref the normalized smallest descriptor distance of a compound x to a reference data set and C Dx,mod the combination of NDDx,ref with the dispersion of the Bayesian neur al nets calculations. The results show that it is possible to obtain a gene ric predictive model from database I but that the diversity of database II is too restricted to give a model with good generalization ability and that the ARD method applied to the mixed database III gives the best predictive model.