Is. Mian et I. Dubchak, Representing and reasoning about protein families using generative and discriminative methods, J COMPUT BI, 7(6), 2000, pp. 849-862
This work addresses the issues of data representation and incorporation of
domain knowledge into the design of learning systems for reasoning about pr
otein families. Given the limited expressive capacity of a particular metho
d, a mixture of protein annotation and fold recognition experts, each imple
menting a different underlying representation, should provide a robust meth
od for assigning sequences to families. These ideas are illustrated using t
wo data-driven learning methods that make use of different prior informatio
n and employ independent, yet complementary, projections of a family: hidde
n Markov models (HMMs) based on a multiple sequence alignment and neural ne
tworks (NNs) based on global sequence descriptors of proteins. Examination
of seven protein families indicates that combining a generative (HMM) and a
discriminative (NN) method is better than either method on its own. Biolog
ically, human 4-hydroxyphenylpyruvic acid dioxygenase, involved in tyrosine
mia type 3, is predicted to be structurally and functionally related to the
glyoxalase I family.