I. Rigoutsos et al., Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins, PROTEINS, 37(2), 1999, pp. 264-277
Using TEIRESIAS, a pattern discovery method that identifies all motifs pres
ent in any given set of protein sequences without requiring alignment or ex
plicit enumeration of the solution space, we have explored the GenPept sequ
ence database and built a dictionary of all sequence patterns with two or m
ore instances. The entries of this dictionary, henceforth named seqlets, co
ver 98.12% of all amino acid positions in the input database and in essence
provide a comprehensive finite set of descriptors for protein sequence spa
ce, As such, seqlets can be effectively used to describe almost every natur
ally occurring protein. In fact, seqlets can be thought of as building bloc
ks of protein molecules that are a necessary (but not sufficient) condition
for function or family equivalence memberships, Thus, seqlets can either d
efine conserved family signatures or cut across molecular families and prev
iously undetected sequence signals deriving from functional convergence. Mo
reover, we show that seqlets also can capture structurally conserved motifs
. The availability of a dictionary of seqlets that has been derived in such
an unsupervised, hierarchical manner is generating new opportunities for a
ddressing problems that range from reliable classification and the correlat
ion of sequence fragments with functional categories to faster and sensitiv
e engines for homology searches, evolutionary studies, and protein structur
e prediction. Proteins 1999;37:284-277, (C) 1999 Wiley-Liss, Inc.