Motivation: Data Mining Prediction (DMP) is a novel approach to predicting
protein functional class from sequence. DMP works even in the absence of a
homologous protein of known function. We investigate the utility of differe
nt ways of representing protein sequence in DMP (residue frequencies, phylo
geny, predicted structure) using the Escherichia coil genome as a model.
Results: Using the different representations DMP learnt prediction rules th
at were more accurate than default at every level of function using every t
ype of representation. The most effective way to represent sequence was usi
ng phylogeny (75% accuracy and 13% coverage of unassigned ORFs at the most
general level of function: 69% accuracy and 7% coverage at the most detaile
d). We tested different methods for combining predictions from the differen
t types of representation. These improved both the accuracy and coverage of
predictions, e.g. 40% of all unassigned ORFs could be predicted at an esti
mated accuracy of 60% and 5% of unassigned ORFs could be predicted at an es
timated accuracy of 86%.