Sh. Muggleton et al., Are grammatical representations useful for learning from biological sequence data? A case study, J COMPUT BI, 8(5), 2001, pp. 493-521
This paper investigates whether Chomsky-like grammar representations are us
eful for learning cost-effective, comprehensible predictors of members of b
iological sequence families. The Inductive Logic Programming (ILP) Bayesian
approach to learning from positive examples is used to generate a grammar
for recognising a class of proteins known as human neuropeptide precursors
(NPPs). Collectively, five of the co-authors of this paper, have extensive
expertise on NPPs and general bioinformatics methods. Their motivation for
generating a NPP grammar was that none of the existing bioinformatics metho
ds could provide sufficient cost-savings during the search for new NPPs. Pr
ior to this project experienced specialists at SmithKline Beecham had tried
for many months to hand-code such a grammar but without success. Our best
predictor makes the search for novel NPPs more than 100 times more efficien
t than randomly selecting proteins for synthesis and testing them for biolo
gical activity. As far as these authors are aware, this is both the first b
iological grammar learnt using ILP and the first real-world scientific appl
ication of the ILP Bayesian approach to learning from positive examples. A
group of features is derived from this grammar. Other groups of features of
NPPs are derived using other learning strategies. Amalgams of these groups
are formed. A recognition model is generated for each amalgam using C4.5 a
nd C4.5rules and its performance is measured using both predictive accuracy
and a new cost function, Relative Advantage (RA). The highest RA was achie
ved by a model which includes grammar-derived features. This RA is signific
antly higher than the best RA achieved without the use of the grammar-deriv
ed features. Predictive accuracy is not a good measure of performance for t
his domain because it does not discriminate well between NPP recognition mo
dels: despite covering varying numbers of (the rare) positives, all the mod
els are awarded a similar (high) score by predictive accuracy because they
all exclude most of the abundant negatives.