ITA
ENG

ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space

Authors

Yona, G Linial, N Linial, M

Citation

G. Yona et al., ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space, PROTEINS, 37(3), 1999, pp. 360-378

Citations number

Categorie Soggetti

Biochemistry & Biophysics

Journal title

PROTEINS-STRUCTURE FUNCTION AND GENETICS

ISSN journal

08873585 → ACNP

Volume

Issue

Year of publication

1999

Pages

360 - 378

Database

ISI

SICI code

0887-3585(19991115)37:3<360:PACOPS>2.0.ZU;2-4

Abstract

We investigate the space of all protein sequences in search of clusters of related proteins, Our aim is to automatically detect these sets, and thus o btain a classification of all protein sequences, Our analysis, which uses s tandard measures of sequence similarity as applied to an all-vs.-all compar ison of SWISSPROT, gives a very conservative initial classification based o n the highest scoring pairs. The many classes in this classification corres pond to protein subfamilies. Subsequently we merge the subclasses using the weaker pairs in a two-phase clustering algorithm. The algorithm makes use of transitivity to identify homologous proteins; however, transitivity is a pplied restrictively in an attempt to prevent unrelated proteins from clust ering together. This process is repeated at varying levels of statistical s ignificance. Consequently, a hierarchical organization of all proteins is o btained. The resulting classification splits the protein space into well-defined gro ups of proteins, which are closely correlated with natural biological famil ies and superfamilies. Different indices of validity were applied to assess the quality of our classification and compare it with the protein families in the PROSITE and Pfam databases. Our classification agrees with these do main-based classifications for between 64.8% and 88.5% of the proteins. It also finds many new clusters of protein sequences which were not classified by these databases. The hierarchical organization suggested by our analysi s reveals finer subfamilies in families of known proteins as well as many n ovel relations between protein families. Proteins 1999;37:360-378. (C) 1999 Wiley-Liss, Inc.