M. Linial et al., GLOBAL SELF-ORGANIZATION OF ALL KNOWN PROTEIN SEQUENCES REVEALS INHERENT BIOLOGICAL SIGNATURES, Journal of Molecular Biology, 268(2), 1997, pp. 539-556
A global classification of all currently known protein sequences is pe
rformed. Every protein sequence is partitioned into segments of 50 ami
no acid residues and a dynamic programming distance is calculated betw
een each pair of segments. This space of segments is initially embedde
d into Euclidean space. The algorithm that we apply embeds every finit
e metric space into Euclidean space so that (1) the dimension of the h
ost space is small, (2) the metric distortion is small. A novel self-o
rganized, cross-validated clustering algorithm is then applied to the
embedded space with Euclidean distances. We monitor the validity of ou
r clustering by randomly splitting the data into two parts and perform
ing an hierarchical clustering algorithm independently on each part. A
t every level of the hierarchy we cross-validate the clusters in one p
art with the clusters in the other. The resulting hierarchical tree of
clusters offers a new representation of protein sequences and familie
s, which compares favorably with the most updated classifications base
d on functional and structural data about proteins. Same of the known
families clustered into well distinct clusters. Motifs and domains suc
h as the zinc finger, EF hand, homeobox, EGF-like and others are autom
atically correctly identified, and relations between protein families
are revealed by examining the splits along the tree. This clustering l
eads to a novel representation of protein families, from which functio
nal biological kinship of protein families can be deduced, as demonstr
ated for the transporter family. Finally, we introduce a new concise r
epresentation for complete proteins that is very useful in presenting
multiple alignments, and in searching for close relatives in the datab
ase. The self-organization method presented is very general and applie
s to any data with a consistent and computable measure of similarity b
etween data items. (C) 1997 Academic Press Limited.