The utility of different representations of protein sequence for predicting functional class

Citation
Rd. King et al., The utility of different representations of protein sequence for predicting functional class, BIOINFORMAT, 17(5), 2001, pp. 445-454
Citations number
36
Categorie Soggetti
Multidisciplinary
Journal title
BIOINFORMATICS
ISSN journal
13674803 → ACNP
Volume
17
Issue
5
Year of publication
2001
Pages
445 - 454
Database
ISI
SICI code
1367-4803(200105)17:5<445:TUODRO>2.0.ZU;2-1
Abstract
Motivation: Data Mining Prediction (DMP) is a novel approach to predicting protein functional class from sequence. DMP works even in the absence of a homologous protein of known function. We investigate the utility of differe nt ways of representing protein sequence in DMP (residue frequencies, phylo geny, predicted structure) using the Escherichia coil genome as a model. Results: Using the different representations DMP learnt prediction rules th at were more accurate than default at every level of function using every t ype of representation. The most effective way to represent sequence was usi ng phylogeny (75% accuracy and 13% coverage of unassigned ORFs at the most general level of function: 69% accuracy and 7% coverage at the most detaile d). We tested different methods for combining predictions from the differen t types of representation. These improved both the accuracy and coverage of predictions, e.g. 40% of all unassigned ORFs could be predicted at an esti mated accuracy of 60% and 5% of unassigned ORFs could be predicted at an es timated accuracy of 86%.