Dw. Rice et D. Eisenberg, A 3D-1D SUBSTITUTION MATRIX FOR PROTEIN FOLD RECOGNITION THAT INCLUDES PREDICTED SECONDARY STRUCTURE OF THE SEQUENCE, Journal of Molecular Biology, 267(4), 1997, pp. 1026-1038
In protein fold recognition, a probe amino acid sequence is compared t
o a Library of representative folds of known structure to identify a s
tructural homolog. In cases where the probe and its homolog have clear
sequence similarity, traditional residue substitution matrices have b
een used to predict the structural similarity. In cases where the prob
e is sequentially distant from its homolog, we have developed a (7 x 3
x 2 x 7 x 3) 3D-1D substitution matrix (called H3P2), calculated from
a database of 119 structural pairs. Members of each pair share a simi
lar fold, but have sequence identity less than 30%. Each probe sequenc
e position is defined by one of seven residue classes and three second
ary structure classes. Each homologous fold position is defined by one
of seven residue classes, three secondary structure classes, and two
burial classes. Thus the matrix is five-dimensional and contains 7 x 3
x 2 x 7 x 3 = 882 elements or 3D-1D scores. The first step in assigni
ng a probe sequence to its homologous fold is the prediction of the th
ree-state (helix, strand, coil) secondary structure of the probe; here
we use the profile based neural network prediction of secondary struc
ture (PHD) program. Then a dynamic programming algorithm uses the H3P2
matrix to align the probe sequence with structures in a representativ
e fold library. To test the effectiveness of the H3P2 matrix a challen
ging, fold class diverse, and cross-validated benchmark assessment is
used to compare the H3P2 matrix to the GONNET, PAM250, BLOSUM62 and a
secondary structure only substitution matrix. For distantly related se
quences the H3P2 matrix detects more homologous structures at higher r
eliabilities than do these other substitution matrices, based on sensi
tivity versus specificity plots (or SENS-SPEC plots). The added effica
cy of the H3P2 matrix arises from its information on the statistical p
references for various sequence-structure environment combinations fro
m very distantly related proteins. It introduces the predicted seconda
ry structure information from a sequence into fold recognition in a st
atistical way that normalizes the inherent correlations between residu
e type, secondary structure and solvent accessibility. (C) 1997 Academ
ic Press Limited.