G. Yona et al., ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space, PROTEINS, 37(3), 1999, pp. 360-378
We investigate the space of all protein sequences in search of clusters of
related proteins, Our aim is to automatically detect these sets, and thus o
btain a classification of all protein sequences, Our analysis, which uses s
tandard measures of sequence similarity as applied to an all-vs.-all compar
ison of SWISSPROT, gives a very conservative initial classification based o
n the highest scoring pairs. The many classes in this classification corres
pond to protein subfamilies. Subsequently we merge the subclasses using the
weaker pairs in a two-phase clustering algorithm. The algorithm makes use
of transitivity to identify homologous proteins; however, transitivity is a
pplied restrictively in an attempt to prevent unrelated proteins from clust
ering together. This process is repeated at varying levels of statistical s
ignificance. Consequently, a hierarchical organization of all proteins is o
btained.
The resulting classification splits the protein space into well-defined gro
ups of proteins, which are closely correlated with natural biological famil
ies and superfamilies. Different indices of validity were applied to assess
the quality of our classification and compare it with the protein families
in the PROSITE and Pfam databases. Our classification agrees with these do
main-based classifications for between 64.8% and 88.5% of the proteins. It
also finds many new clusters of protein sequences which were not classified
by these databases. The hierarchical organization suggested by our analysi
s reveals finer subfamilies in families of known proteins as well as many n
ovel relations between protein families. Proteins 1999;37:360-378. (C) 1999
Wiley-Liss, Inc.