GLOBAL SELF-ORGANIZATION OF ALL KNOWN PROTEIN SEQUENCES REVEALS INHERENT BIOLOGICAL SIGNATURES

Citation
M. Linial et al., GLOBAL SELF-ORGANIZATION OF ALL KNOWN PROTEIN SEQUENCES REVEALS INHERENT BIOLOGICAL SIGNATURES, Journal of Molecular Biology, 268(2), 1997, pp. 539-556
Citations number
35
Categorie Soggetti
Biology
ISSN journal
00222836
Volume
268
Issue
2
Year of publication
1997
Pages
539 - 556
Database
ISI
SICI code
0022-2836(1997)268:2<539:GSOAKP>2.0.ZU;2-Y
Abstract
A global classification of all currently known protein sequences is pe rformed. Every protein sequence is partitioned into segments of 50 ami no acid residues and a dynamic programming distance is calculated betw een each pair of segments. This space of segments is initially embedde d into Euclidean space. The algorithm that we apply embeds every finit e metric space into Euclidean space so that (1) the dimension of the h ost space is small, (2) the metric distortion is small. A novel self-o rganized, cross-validated clustering algorithm is then applied to the embedded space with Euclidean distances. We monitor the validity of ou r clustering by randomly splitting the data into two parts and perform ing an hierarchical clustering algorithm independently on each part. A t every level of the hierarchy we cross-validate the clusters in one p art with the clusters in the other. The resulting hierarchical tree of clusters offers a new representation of protein sequences and familie s, which compares favorably with the most updated classifications base d on functional and structural data about proteins. Same of the known families clustered into well distinct clusters. Motifs and domains suc h as the zinc finger, EF hand, homeobox, EGF-like and others are autom atically correctly identified, and relations between protein families are revealed by examining the splits along the tree. This clustering l eads to a novel representation of protein families, from which functio nal biological kinship of protein families can be deduced, as demonstr ated for the transporter family. Finally, we introduce a new concise r epresentation for complete proteins that is very useful in presenting multiple alignments, and in searching for close relatives in the datab ase. The self-organization method presented is very general and applie s to any data with a consistent and computable measure of similarity b etween data items. (C) 1997 Academic Press Limited.