ITA
ENG

PREDICTION OF SECONDARY STRUCTURAL CONTENT OF PROTEINS FROM THEIR AMINO-ACID-COMPOSITION ALONE .1. NEW ANALYTIC VECTOR DECOMPOSITION METHODS

Authors

EISENHABER F IMPERIALE F ARGOS P FROMMEL C

Citation

F. Eisenhaber et al., PREDICTION OF SECONDARY STRUCTURAL CONTENT OF PROTEINS FROM THEIR AMINO-ACID-COMPOSITION ALONE .1. NEW ANALYTIC VECTOR DECOMPOSITION METHODS, Proteins, 25(2), 1996, pp. 157-168

Citations number

Categorie Soggetti

Biology

Journal title

Proteins → ACNP

ISSN journal

08873585

Volume

Issue

Year of publication

1996

Pages

157 - 168

Database

ISI

SICI code

0887-3585(1996)25:2<157:POSSCO>2.0.ZU;2-T

Abstract

The predictive limits of the amino acid composition for the secondary structural content (percentage of residues in the secondary structural states helix, sheet, and coil) in proteins are assessed quantitativel y. For the first time, techniques for prediction of secondary structur al content are presented which rely on the amino acid composition as t he only information on the query protein. In our first method, the ami no acid composition of an unknown protein is represented by the best ( in a least square sense) linear combination of the characteristic amin o acid compositions of the three secondary structural types computed f rom a learning set of tertiary structures. The second technique is a g eneralization of the first one and takes into account also possible co mpositional couplings between any two sorts of amino acids. Its mathem atical formulation results in an eigenvalue/eigenvector problem of the second moment matrix describing the amino acid compositional fluctuat ions of secondary structural types in various proteins of a learning s et. Possible correlations of the principal directions of the eigenspac es with physical properties of the amino acids were also checked. For example, the first two eigenvectors of the helical eigenspace correlat e with the size and hydrophobicity of the residue types respectively. As learning and test sets of tertiary structures, we utilized represen tative, automatically generated subsets of Protein Data Bank (PDB) con sisting of non-homologous protein structures at the resolution thresho lds less than or equal to 1.8 Angstrom, less than or equal to 2.0 Angs trom, less than or equal to 2.5 Angstrom, and less than or equal to 3. 0 Angstrom. We show that the consideration of compositional couplings improves prediction accuracy, albeit not dramatically. Whereas in the self-consistency test (learning with the protein to be predicted), a c lear decrease of prediction accuracy with worsening resolution is obse rved, the jackknife test (leave the predicted protein out) yielded bes t results for the largest dataset (less than or equal to 3.0 Angstrom, almost no difference to the self-consistency test!), i.e., only this set, with more than 400 proteins, is sufficient for stable computation of the parameters in the prediction function of the second method. Th e average absolute error in predicting the fraction of helix, sheet, a nd coil from amino acid composition of the query protein are 13.7, 12. 6, and 11.4%, respectively with r.m.s. deviations in the range of 8.6 +/- 11.8% for the 3.0 Angstrom dataset in a jackknife test. The absolu te precision of the average absolute errors is in the range of 1 divid ed by 3% as measured for other representative subsets of the PDB. Seco ndary structural content prediction methods found in the literature ha ve been clustered in accordance with their prediction accuracies. To o ur surprise, much more complex secondary structure prediction methods utilized for the same purpose of secondary structural content predicti on achieve prediction accuracies very similar to those of the present analytic techniques, implying that all the information beyond the amin o acid composition is, in fact, mainly utilized for positioning the se condary structural state in the sequence but not for determination of the overall number of residues in a secondary structural type. This re sult implies that higher prediction accuracies cannot be achieved rely ing solely on the amino acid composition of an unknown query protein a s prediction input. Our prediction program SSCP has been made availabl e as a World Wide Web and E-mail service. (C) 1996 Wiley-Liss, Inc.