Functional classification of proteins by pattern discovery and top-down clustering of primary sequences

Citation
Ah. Liu et A. Califano, Functional classification of proteins by pattern discovery and top-down clustering of primary sequences, IBM SYST J, 40(2), 2001, pp. 379-393
Citations number
44
Categorie Soggetti
Computer Science & Engineering
Journal title
IBM SYSTEMS JOURNAL
ISSN journal
00188670 → ACNP
Volume
40
Issue
2
Year of publication
2001
Pages
379 - 393
Database
ISI
SICI code
0018-8670(2001)40:2<379:FCOPBP>2.0.ZU;2-Y
Abstract
Given a functionally heterogeneous set of proteins, such as a large superfa mily or an entire database, two important problems in biology are the autom ated inference of subsets of functionally related proteins and the identifi cation of functional regions and residues. The former is typically performe d in an unsupervised bottom-up manner, by clustering based on pair-wise seq uence similarity. The latter is performed independently, in a supervised to p-down manner starting from functional sets that have already been identifi ed by either biological or computational means. Clearly, however, the two p rocesses remain inextricably linked, because functional motifs and residues are related to corresponding functional clusters. This paper introduces a highperformance, top-down clustering technique and the corresponding system that determines functionally related clusters and functional motifs by cou pling a pattern discovery algorithm, a statistical framework for the analys is of discovered patterns, and a motif refinement method based on hidden Ma rkov models. Results are reported for the G protein-coupled receptor superf amily. These show that a significant majority of well-known functional sets and biologically relevant motifs are correctly recovered. They also show t hat a majority of the important functional residues reported in the literat ure occur in the inferred functional motifs. This technique has relevant im plication in functional clustering and could be used as a highly predictive aid to mutagenesis experiments.