Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins

Citation
I. Rigoutsos et al., Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins, PROTEINS, 37(2), 1999, pp. 264-277
Citations number
53
Categorie Soggetti
Biochemistry & Biophysics
Journal title
PROTEINS-STRUCTURE FUNCTION AND GENETICS
ISSN journal
08873585 → ACNP
Volume
37
Issue
2
Year of publication
1999
Pages
264 - 277
Database
ISI
SICI code
0887-3585(19991101)37:2<264:DBVUHM>2.0.ZU;2-D
Abstract
Using TEIRESIAS, a pattern discovery method that identifies all motifs pres ent in any given set of protein sequences without requiring alignment or ex plicit enumeration of the solution space, we have explored the GenPept sequ ence database and built a dictionary of all sequence patterns with two or m ore instances. The entries of this dictionary, henceforth named seqlets, co ver 98.12% of all amino acid positions in the input database and in essence provide a comprehensive finite set of descriptors for protein sequence spa ce, As such, seqlets can be effectively used to describe almost every natur ally occurring protein. In fact, seqlets can be thought of as building bloc ks of protein molecules that are a necessary (but not sufficient) condition for function or family equivalence memberships, Thus, seqlets can either d efine conserved family signatures or cut across molecular families and prev iously undetected sequence signals deriving from functional convergence. Mo reover, we show that seqlets also can capture structurally conserved motifs . The availability of a dictionary of seqlets that has been derived in such an unsupervised, hierarchical manner is generating new opportunities for a ddressing problems that range from reliable classification and the correlat ion of sequence fragments with functional categories to faster and sensitiv e engines for homology searches, evolutionary studies, and protein structur e prediction. Proteins 1999;37:284-277, (C) 1999 Wiley-Liss, Inc.