ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space

Citation
G. Yona et al., ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space, PROTEINS, 37(3), 1999, pp. 360-378
Citations number
54
Categorie Soggetti
Biochemistry & Biophysics
Journal title
PROTEINS-STRUCTURE FUNCTION AND GENETICS
ISSN journal
08873585 → ACNP
Volume
37
Issue
3
Year of publication
1999
Pages
360 - 378
Database
ISI
SICI code
0887-3585(19991115)37:3<360:PACOPS>2.0.ZU;2-4
Abstract
We investigate the space of all protein sequences in search of clusters of related proteins, Our aim is to automatically detect these sets, and thus o btain a classification of all protein sequences, Our analysis, which uses s tandard measures of sequence similarity as applied to an all-vs.-all compar ison of SWISSPROT, gives a very conservative initial classification based o n the highest scoring pairs. The many classes in this classification corres pond to protein subfamilies. Subsequently we merge the subclasses using the weaker pairs in a two-phase clustering algorithm. The algorithm makes use of transitivity to identify homologous proteins; however, transitivity is a pplied restrictively in an attempt to prevent unrelated proteins from clust ering together. This process is repeated at varying levels of statistical s ignificance. Consequently, a hierarchical organization of all proteins is o btained. The resulting classification splits the protein space into well-defined gro ups of proteins, which are closely correlated with natural biological famil ies and superfamilies. Different indices of validity were applied to assess the quality of our classification and compare it with the protein families in the PROSITE and Pfam databases. Our classification agrees with these do main-based classifications for between 64.8% and 88.5% of the proteins. It also finds many new clusters of protein sequences which were not classified by these databases. The hierarchical organization suggested by our analysi s reveals finer subfamilies in families of known proteins as well as many n ovel relations between protein families. Proteins 1999;37:360-378. (C) 1999 Wiley-Liss, Inc.