Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases

Citation
A. Wallqvist et al., Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases, BIOINFORMAT, 16(11), 2000, pp. 988-1002
Citations number
77
Categorie Soggetti
Multidisciplinary
Journal title
BIOINFORMATICS
ISSN journal
13674803 → ACNP
Volume
16
Issue
11
Year of publication
2000
Pages
988 - 1002
Database
ISI
SICI code
1367-4803(200011)16:11<988:ISSSFP>2.0.ZU;2-Q
Abstract
Motivation: Sequence alignment techniques have been developed into extremel y powerful tools for identifying the folding families and function of prote ins in newly sequenced genomes. For a sufficiently low sequence identity it is necessary to incorporate additional structural information to positivel y detect homologous proteins. We have carried out an extensive analysis of the effectiveness of incorporating secondary structure information directly into the alignments for fold recognition and identification of distant pro tein homologs. A secondary structure similarity matrix based on a database of three-dimensionally aligned proteins was first constructed. An iterative application of dynamic programming was used which incorporates linens comb inations of amino acid and secondary structure sequence similarity scores. Initially, only primary sequence information is used. Subsequently contribu tions from secondary structure are phased in and new homologous proteins ar e positively identified if their scores are consistent with the predetermin ed error rate. Results: We used the SCOP40 database, where only PDB sequences that have 40 % homology or less are included, to calibrate homology detection by the com bined amino acid and secondary structure sequence alignments. Combining pre dicted secondary structure with sequence information results in a 8-15% inc rease in homology detection within SCOP40 relative to the pairwise alignmen ts using only amino acid sequence data at an error rate of 0.01 errors per query; a 35% increase is observed when the actual secondary structure seque nces are used. Incorporating predicted secondary structure information in t he analysis of six small genomes yields an improvement in the homology dete ction of similar to 20% over SSEARCH pairwise alignments, but no improvemen t in the total number of homologs detected over PSI-BLAST, at an error rate of 0.01 errors per query. However because the pairwise alignments based on combinations of amino acid and secondary structure similarity are differen t from those produced by PSI-BLAST and the error rates can be calibrated it is possible to combine the results of both searches. An additional 25% rel ative improvement in the number of genes identified at an error rate of 0.0 1 is observed when the data is pooled in this way. Similarly for the SCOP40 dataset, PSI-BLAST detected 15% of all possible homologs, whereas the pool ed results increased the total number of homologs detected to 19%. These re sults are compared with recent reports of homology detection using sequence profiling methods.