J. Wojcik et al., New efficient statistical sequence-dependent structure prediction of shortto medium-sized protein loops based on an exhaustive loop classification, J MOL BIOL, 289(5), 1999, pp. 1469-1490
A bank of 13,563 loops from three to eight amino acid residues long, repres
enting motifs between two consecutive regular secondary structures, has bee
n derived from protein structures presenting less than 95 % sequence identi
ty. Statistical analyses of occurrences of conformations and residues revea
led length-dependent over-representations of particular amino acids (glycin
e, proline, asparagine, serine, and aspartate) and conformations (alpha(L),
epsilon, beta(P) regions of the Ramachandran plot). A position-dependent d
istribution of these occurrences was observed for N and C-terminal residues
, which are correlated to the nature of the flanking regions. Loops of the
same length were clustered into statistically meaningful families on the ba
sis of their backbone structures when placed in a common reference frame, i
ndependent of the flanks. These clusters present significantly different di
stributions of sequence, conformations, and endpoint residue C-alpha distan
ces. On the basis of the sequence-structure correlation of this clustering,
an automatic loop modeling algorithm was developed. Based on the knowledge
of its sequence and of its flank backbone structures each query loop is as
signed to a family and target loop supports are selected in this family. Th
e support backbones of these target loops are then adjusted on flanking str
uctures by partial exploration of the conformational space. Loop closure is
performed by energy minimization for each support and the final model is c
hosen among connected supports based upon energy criteria. The quality of t
he prediction is evaluated by the root-mean-square deviation (rmsd) between
the final model and the native loops when the whole bank is re-attributed
on itself with a Jackknife test. This average rmsd ranges from 1.1 Angstrom
for three-residue loops to 3.8 Angstrom for eight-residue loops. A few poo
rly predicted loops are inescapable, considering the high level of diversit
y in loops and the lack of environment data. To overcome such modeling prob
lems, a statistical reliability score was assigned for each prediction. Thi
s score is correlated to the quality of the prediction, in terms of rmsd, a
nd thus improves the selection accuracy of the model. The algorithm efficie
ncy was compared to CASP3 target loop predictions. Moreover, when tested on
a test loop bank, this algorithm was shown to be robust when the loops are
not precisely delimited, therefore proving to be a useful tool in practice
for protein modeling. (C) 1999 Academic Press.