Ah. Liu et A. Califano, Functional classification of proteins by pattern discovery and top-down clustering of primary sequences, IBM SYST J, 40(2), 2001, pp. 379-393
Given a functionally heterogeneous set of proteins, such as a large superfa
mily or an entire database, two important problems in biology are the autom
ated inference of subsets of functionally related proteins and the identifi
cation of functional regions and residues. The former is typically performe
d in an unsupervised bottom-up manner, by clustering based on pair-wise seq
uence similarity. The latter is performed independently, in a supervised to
p-down manner starting from functional sets that have already been identifi
ed by either biological or computational means. Clearly, however, the two p
rocesses remain inextricably linked, because functional motifs and residues
are related to corresponding functional clusters. This paper introduces a
highperformance, top-down clustering technique and the corresponding system
that determines functionally related clusters and functional motifs by cou
pling a pattern discovery algorithm, a statistical framework for the analys
is of discovered patterns, and a motif refinement method based on hidden Ma
rkov models. Results are reported for the G protein-coupled receptor superf
amily. These show that a significant majority of well-known functional sets
and biologically relevant motifs are correctly recovered. They also show t
hat a majority of the important functional residues reported in the literat
ure occur in the inferred functional motifs. This technique has relevant im
plication in functional clustering and could be used as a highly predictive
aid to mutagenesis experiments.