Prediction of n-octanol/water partition coefficients from PHYSPROP database using artificial neural networks and E-state indices

Citation
Iv. Tetko et al., Prediction of n-octanol/water partition coefficients from PHYSPROP database using artificial neural networks and E-state indices, J CHEM INF, 41(5), 2001, pp. 1407-1421
Citations number
53
Categorie Soggetti
Chemistry
Journal title
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES
ISSN journal
00952338 → ACNP
Volume
41
Issue
5
Year of publication
2001
Pages
1407 - 1421
Database
ISI
SICI code
0095-2338(200109/10)41:5<1407:PONPCF>2.0.ZU;2-O
Abstract
A new method, ALOGPS v 2.0 (http://www.lnh.unil.ch/similar to itetko/logp/) , for the assessment of n-octanol/ water partition coefficient, log P, was developed on the basis of neural network ensemble analysis of 12 908 organi c compounds available from PHYSPROP database of Syracuse Research Corporati on. The atom and bond-type E-state indices as well as the number of hydroge n and non-hydrogen atoms were used to represent the molecular structures. A preliminary selection of indices was performed by multiple linear regressi on analysis, and 75 input parameters were chosen. Some of the parameters co mbined several atom-type or bond-type indices with similar physicochemical properties. The neural network ensemble training was performed by efficient partition algorithm developed by the authors. The ensemble contained 50 ne ural networks, and each neural network had 10 neurons in one hidden layer. The prediction ability of the developed approach was estimated using both l eave-one-out (LOO) technique and training/test protocol. In case of interse ries predictions, i.e., when molecules in the test and in the training subs ets were selected by chance from the same set of compounds, both approaches provided similar results. ALOGPS performance was significantly better than the results obtained by other tested methods. For a subset of 12 777 molec ules the LOO results, namely correlation coefficient r(2) = 0.95, root mean squared error, RMSE = 0.39, and an absolute mean error, MAE = 0.29, were c alculated. For two cross-series predictions, i.e., when molecules in the tr aining and in the test sets belong to different series of compounds, all an alyzed methods performed less efficiently. The decrease in the performance could be explained by a different diversity of molecules in the training an d in the test sets. However, even for such difficult cases the ALOGPS metho d provided better prediction ability than the other tested methods. We have shown that the diversity of the training sets rather than the design of th e methods is the main factor determining their prediction ability for new d ata. A comparative performance of the methods as well as a dependence on th e number of non-hydrogen atoms in a molecule is also presented.