Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases
A. Wallqvist et al., Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases, BIOINFORMAT, 16(11), 2000, pp. 988-1002
Motivation: Sequence alignment techniques have been developed into extremel
y powerful tools for identifying the folding families and function of prote
ins in newly sequenced genomes. For a sufficiently low sequence identity it
is necessary to incorporate additional structural information to positivel
y detect homologous proteins. We have carried out an extensive analysis of
the effectiveness of incorporating secondary structure information directly
into the alignments for fold recognition and identification of distant pro
tein homologs. A secondary structure similarity matrix based on a database
of three-dimensionally aligned proteins was first constructed. An iterative
application of dynamic programming was used which incorporates linens comb
inations of amino acid and secondary structure sequence similarity scores.
Initially, only primary sequence information is used. Subsequently contribu
tions from secondary structure are phased in and new homologous proteins ar
e positively identified if their scores are consistent with the predetermin
ed error rate.
Results: We used the SCOP40 database, where only PDB sequences that have 40
% homology or less are included, to calibrate homology detection by the com
bined amino acid and secondary structure sequence alignments. Combining pre
dicted secondary structure with sequence information results in a 8-15% inc
rease in homology detection within SCOP40 relative to the pairwise alignmen
ts using only amino acid sequence data at an error rate of 0.01 errors per
query; a 35% increase is observed when the actual secondary structure seque
nces are used. Incorporating predicted secondary structure information in t
he analysis of six small genomes yields an improvement in the homology dete
ction of similar to 20% over SSEARCH pairwise alignments, but no improvemen
t in the total number of homologs detected over PSI-BLAST, at an error rate
of 0.01 errors per query. However because the pairwise alignments based on
combinations of amino acid and secondary structure similarity are differen
t from those produced by PSI-BLAST and the error rates can be calibrated it
is possible to combine the results of both searches. An additional 25% rel
ative improvement in the number of genes identified at an error rate of 0.0
1 is observed when the data is pooled in this way. Similarly for the SCOP40
dataset, PSI-BLAST detected 15% of all possible homologs, whereas the pool
ed results increased the total number of homologs detected to 19%. These re
sults are compared with recent reports of homology detection using sequence
profiling methods.