Iterated sequence databank search methods were assessed from the viewpoint
of someone with the sequence of a novel gene product wishing to find distan
t relatives to their protein and, with the specific searches against the PD
B, also hoping to find a relative of known structure. We examined three met
hods in detail, spanning a range from simple pattern-matching to sophistica
ted weighted profiles. Rather than apply these methods 'blindly' (with defa
ult parameters) to a large number of test queries, we have concentrated on
the globins, so allowing a more detailed investigation of each method on di
fferent data subsets with different parameter settings. Despite their wides
pread use, regular-expression matching proved to be very limited-seldom ext
ending beyond the sub-family from which the pattern was derived. To attain
any generality, the patterns had to be 'stripped-down' to include only the
most highly conserved parts. The QUEST program avoided these problems by in
troducing a more flexible (weighted) matching. On the PDB sequences this wa
s highly effective, missing only a few globins with probes based on each su
b-family or even a single representative from each sub-family. In addition,
very few false-positives were encountered, and those that did match, often
only did so for a few cycles before being lost again. On the larger sequen
ce collection, however, QUEST encountered problems with maintaining (or ach
ieving) the alignment of the full globin family. Psi-BLAST also recognised
almost all the globins when matching against the PDB sequences, typically,
missing three or four of the most distantly related sequences while picking
-up a few false-positives. In contrast to QUEST, Psi-BLAST performed very w
ell on the larger databank, getting almost a full collection of globins alt
hough still retaining the same proportion of false-positives. SAM applied t
o the PDB sequences performed reasonably well with the myoglobin and hemogl
obin families as probes, missing, typically several of the more difficult p
roteins but performed poorly with the leghemoglobin probe. Only with the fu
ll family range as a probe did it produce results comparable to Psi-BLAST a
nd QUEST. With the larger databank, SAM produced a good result but, again,
this was only achieved using the full range of sequence variation with the
default regulariser and use of Dirichlet mixtures completely failed in this
situation. (C) 1999 Elsevier Science Ltd. All rights reserved.