Motivation: A new hidden Markov model method (SAM-T98) for finding remote h
omologs of protein sequences is described and evaluated. The method begins
with a simple target sequence and iteratively builds a hidden Markov model
(HMM) from the sequence and homologs found using die HMM for database searc
h. SAM-T98 is also used to construct model libraries automatically, from se
quences in structural databases.
Methods: We evaluate the SAM-T98 method with foul datasets. Three of the te
st sets are fold-recognition tests, where the correct answers are determine
d by structural similarity. The fourth uses a curated database. The method
is compared against WU-BLASTP and against DOUBLE-BLAST, a two-step method s
imilar to ISS, but using BLAST instead of FASTA.
Results: SAM-T98 had the fewest errors in all tests- dramatically so for th
e fold-recognition tests. At the minimum-error point on the SCOP (Structura
l Classification of Proteins)-domains test, SAM-T98 got 880 flue positives
and 68 false positives, DOUBLE-BLAST got 533 true positives with 71 false p
ositives, ann WU-BLASTP got 353 true positives with 24 false positives. The
method is optimized to recognize superfamilies, and would require paramete
r adjustment to be used to find family or fold relationships, One key to th
e performance of the HMM method is a new score-normalization technique that
compares the score to the score with a reversed model rather than to a uni
form null model.