F. Eisenhaber et al., PREDICTION OF SECONDARY STRUCTURAL CONTENT OF PROTEINS FROM THEIR AMINO-ACID-COMPOSITION ALONE .1. NEW ANALYTIC VECTOR DECOMPOSITION METHODS, Proteins, 25(2), 1996, pp. 157-168
The predictive limits of the amino acid composition for the secondary
structural content (percentage of residues in the secondary structural
states helix, sheet, and coil) in proteins are assessed quantitativel
y. For the first time, techniques for prediction of secondary structur
al content are presented which rely on the amino acid composition as t
he only information on the query protein. In our first method, the ami
no acid composition of an unknown protein is represented by the best (
in a least square sense) linear combination of the characteristic amin
o acid compositions of the three secondary structural types computed f
rom a learning set of tertiary structures. The second technique is a g
eneralization of the first one and takes into account also possible co
mpositional couplings between any two sorts of amino acids. Its mathem
atical formulation results in an eigenvalue/eigenvector problem of the
second moment matrix describing the amino acid compositional fluctuat
ions of secondary structural types in various proteins of a learning s
et. Possible correlations of the principal directions of the eigenspac
es with physical properties of the amino acids were also checked. For
example, the first two eigenvectors of the helical eigenspace correlat
e with the size and hydrophobicity of the residue types respectively.
As learning and test sets of tertiary structures, we utilized represen
tative, automatically generated subsets of Protein Data Bank (PDB) con
sisting of non-homologous protein structures at the resolution thresho
lds less than or equal to 1.8 Angstrom, less than or equal to 2.0 Angs
trom, less than or equal to 2.5 Angstrom, and less than or equal to 3.
0 Angstrom. We show that the consideration of compositional couplings
improves prediction accuracy, albeit not dramatically. Whereas in the
self-consistency test (learning with the protein to be predicted), a c
lear decrease of prediction accuracy with worsening resolution is obse
rved, the jackknife test (leave the predicted protein out) yielded bes
t results for the largest dataset (less than or equal to 3.0 Angstrom,
almost no difference to the self-consistency test!), i.e., only this
set, with more than 400 proteins, is sufficient for stable computation
of the parameters in the prediction function of the second method. Th
e average absolute error in predicting the fraction of helix, sheet, a
nd coil from amino acid composition of the query protein are 13.7, 12.
6, and 11.4%, respectively with r.m.s. deviations in the range of 8.6
+/- 11.8% for the 3.0 Angstrom dataset in a jackknife test. The absolu
te precision of the average absolute errors is in the range of 1 divid
ed by 3% as measured for other representative subsets of the PDB. Seco
ndary structural content prediction methods found in the literature ha
ve been clustered in accordance with their prediction accuracies. To o
ur surprise, much more complex secondary structure prediction methods
utilized for the same purpose of secondary structural content predicti
on achieve prediction accuracies very similar to those of the present
analytic techniques, implying that all the information beyond the amin
o acid composition is, in fact, mainly utilized for positioning the se
condary structural state in the sequence but not for determination of
the overall number of residues in a secondary structural type. This re
sult implies that higher prediction accuracies cannot be achieved rely
ing solely on the amino acid composition of an unknown query protein a
s prediction input. Our prediction program SSCP has been made availabl
e as a World Wide Web and E-mail service. (C) 1996 Wiley-Liss, Inc.