Are grammatical representations useful for learning from biological sequence data? A case study

Citation
Sh. Muggleton et al., Are grammatical representations useful for learning from biological sequence data? A case study, J COMPUT BI, 8(5), 2001, pp. 493-521
Citations number
40
Categorie Soggetti
Biochemistry & Biophysics
Journal title
JOURNAL OF COMPUTATIONAL BIOLOGY
ISSN journal
10665277 → ACNP
Volume
8
Issue
5
Year of publication
2001
Pages
493 - 521
Database
ISI
SICI code
1066-5277(2001)8:5<493:AGRUFL>2.0.ZU;2-R
Abstract
This paper investigates whether Chomsky-like grammar representations are us eful for learning cost-effective, comprehensible predictors of members of b iological sequence families. The Inductive Logic Programming (ILP) Bayesian approach to learning from positive examples is used to generate a grammar for recognising a class of proteins known as human neuropeptide precursors (NPPs). Collectively, five of the co-authors of this paper, have extensive expertise on NPPs and general bioinformatics methods. Their motivation for generating a NPP grammar was that none of the existing bioinformatics metho ds could provide sufficient cost-savings during the search for new NPPs. Pr ior to this project experienced specialists at SmithKline Beecham had tried for many months to hand-code such a grammar but without success. Our best predictor makes the search for novel NPPs more than 100 times more efficien t than randomly selecting proteins for synthesis and testing them for biolo gical activity. As far as these authors are aware, this is both the first b iological grammar learnt using ILP and the first real-world scientific appl ication of the ILP Bayesian approach to learning from positive examples. A group of features is derived from this grammar. Other groups of features of NPPs are derived using other learning strategies. Amalgams of these groups are formed. A recognition model is generated for each amalgam using C4.5 a nd C4.5rules and its performance is measured using both predictive accuracy and a new cost function, Relative Advantage (RA). The highest RA was achie ved by a model which includes grammar-derived features. This RA is signific antly higher than the best RA achieved without the use of the grammar-deriv ed features. Predictive accuracy is not a good measure of performance for t his domain because it does not discriminate well between NPP recognition mo dels: despite covering varying numbers of (the rare) positives, all the mod els are awarded a similar (high) score by predictive accuracy because they all exclude most of the abundant negatives.