The rapid growth in the number of experimentally determined three-dime
nsional protein structures has sharpened the need for comprehensive an
d up-to-date surveys of known structures. Classic work on protein stru
cture classification has made it clear that a structural survey is bes
t carried out at the level of domains, i.e., substructures that recur
in evolution as functional units in different protein contexts. We pre
sent a method for automated domain identification from protein structu
re atomic coordinates based on quantitative measures of compactness an
d, as the new element, recurrence. Compactness criteria are used to re
cursively divide a protein into a series of successively smaller and s
maller substructures. Recurrence criteria are used to select an optima
l size level of these substructures, so that many of the chosen substr
uctures are common to different proteins at a high level of statistica
l significance. The joint application of these criteria automatically
yields consistent domain definitions between remote homologs, a result
difficult to achieve using compactness criteria alone. The method is
applied to a representative set of 1,137 sequence-unique protein famil
ies covering 6,500 known structures. Clustering of the resulting set o
f domains (substructures) yields 594 distinct fold classes (types of s
ubstructures). The Dali Domain Dictionary (http://www.embl-ebi.ac.uk/d
ali/) not only provides a global structural classification, but also a
comprehensive description of families of protein sequences grouped ar
ound representative proteins of known structure. The classification wi
ll be continuously updated and can serve as a basis for improving our
understanding of protein evolution and function and for evolving optim
al strategies to complete the map of all natural protein structures. (
C) 1998 Wiley-liss, Inc.