New efficient statistical sequence-dependent structure prediction of shortto medium-sized protein loops based on an exhaustive loop classification

Citation
J. Wojcik et al., New efficient statistical sequence-dependent structure prediction of shortto medium-sized protein loops based on an exhaustive loop classification, J MOL BIOL, 289(5), 1999, pp. 1469-1490
Citations number
71
Categorie Soggetti
Molecular Biology & Genetics
Journal title
JOURNAL OF MOLECULAR BIOLOGY
ISSN journal
00222836 → ACNP
Volume
289
Issue
5
Year of publication
1999
Pages
1469 - 1490
Database
ISI
SICI code
0022-2836(19990625)289:5<1469:NESSSP>2.0.ZU;2-#
Abstract
A bank of 13,563 loops from three to eight amino acid residues long, repres enting motifs between two consecutive regular secondary structures, has bee n derived from protein structures presenting less than 95 % sequence identi ty. Statistical analyses of occurrences of conformations and residues revea led length-dependent over-representations of particular amino acids (glycin e, proline, asparagine, serine, and aspartate) and conformations (alpha(L), epsilon, beta(P) regions of the Ramachandran plot). A position-dependent d istribution of these occurrences was observed for N and C-terminal residues , which are correlated to the nature of the flanking regions. Loops of the same length were clustered into statistically meaningful families on the ba sis of their backbone structures when placed in a common reference frame, i ndependent of the flanks. These clusters present significantly different di stributions of sequence, conformations, and endpoint residue C-alpha distan ces. On the basis of the sequence-structure correlation of this clustering, an automatic loop modeling algorithm was developed. Based on the knowledge of its sequence and of its flank backbone structures each query loop is as signed to a family and target loop supports are selected in this family. Th e support backbones of these target loops are then adjusted on flanking str uctures by partial exploration of the conformational space. Loop closure is performed by energy minimization for each support and the final model is c hosen among connected supports based upon energy criteria. The quality of t he prediction is evaluated by the root-mean-square deviation (rmsd) between the final model and the native loops when the whole bank is re-attributed on itself with a Jackknife test. This average rmsd ranges from 1.1 Angstrom for three-residue loops to 3.8 Angstrom for eight-residue loops. A few poo rly predicted loops are inescapable, considering the high level of diversit y in loops and the lack of environment data. To overcome such modeling prob lems, a statistical reliability score was assigned for each prediction. Thi s score is correlated to the quality of the prediction, in terms of rmsd, a nd thus improves the selection accuracy of the model. The algorithm efficie ncy was compared to CASP3 target loop predictions. Moreover, when tested on a test loop bank, this algorithm was shown to be robust when the loops are not precisely delimited, therefore proving to be a useful tool in practice for protein modeling. (C) 1999 Academic Press.