A bank of loops from three to eight amino acid residues long has been
constituted. On the basis of statistical analysis of occurrences of co
nformations and residue, loops could be divided into two parts: the si
de residues directly bonded to the secondary structure flanking elemen
t, and the inner part. The conformations of the side residues are corr
elated to the nature of their neighboring flanks, while the inner resi
dues adopt conformations uncorrelated from one residue to the next; th
us they are unrelated to the flanks. Two zones in the Ramachandran plo
t are important: alpha(L) and beta(p). In particular, the high occurre
nce of alpha(L), mainly occupied by glycine residues, is necessary to
induce flexibility and thus allow loops to comply with the geometrical
constraints of the flanks. An algorithm of clustering has been used t
o aggregate loops of the same length within families of similar 3D str
uctures. At each position in each cluster, sequence and conformational
signatures have been deduced if the occurrence of a residue (or a con
formation) is higher than an equiprobable distribution over all cluste
rs. The result is that some positions favor particular amino acids and
conformations, which are typical of a cluster although not unique. Th
is is an indication of a relation between structure and sequence in lo
ops. A taxonomy is proposed that classifies the various clusters. It r
elies on two terms: the mean distance between the first and last C-alp
ha in one cluster and, perpendicular to this line, the distance to the
center of gravity of the cluster. It is noteworthy that the different
ly populated clusters represented in such 2D plots can be separated. T
hus, although the conformations of loops in globular proteins could co
ver a continuum, it has been possible to cluster them into a limited n
umber of well populated families and superfamilies. This basic feature
of protein architecture could be further exploited to better predict
their geometry. (C) 1996 Academic Press Limited