Iterated sequence databank search methods

Citation
Wr. Taylor et Np. Brown, Iterated sequence databank search methods, COMPUT CHEM, 23(3-4), 1999, pp. 365-385
Citations number
35
Categorie Soggetti
Chemistry
Journal title
COMPUTERS & CHEMISTRY
ISSN journal
00978485 → ACNP
Volume
23
Issue
3-4
Year of publication
1999
Pages
365 - 385
Database
ISI
SICI code
0097-8485(1999)23:3-4<365:ISDSM>2.0.ZU;2-U
Abstract
Iterated sequence databank search methods were assessed from the viewpoint of someone with the sequence of a novel gene product wishing to find distan t relatives to their protein and, with the specific searches against the PD B, also hoping to find a relative of known structure. We examined three met hods in detail, spanning a range from simple pattern-matching to sophistica ted weighted profiles. Rather than apply these methods 'blindly' (with defa ult parameters) to a large number of test queries, we have concentrated on the globins, so allowing a more detailed investigation of each method on di fferent data subsets with different parameter settings. Despite their wides pread use, regular-expression matching proved to be very limited-seldom ext ending beyond the sub-family from which the pattern was derived. To attain any generality, the patterns had to be 'stripped-down' to include only the most highly conserved parts. The QUEST program avoided these problems by in troducing a more flexible (weighted) matching. On the PDB sequences this wa s highly effective, missing only a few globins with probes based on each su b-family or even a single representative from each sub-family. In addition, very few false-positives were encountered, and those that did match, often only did so for a few cycles before being lost again. On the larger sequen ce collection, however, QUEST encountered problems with maintaining (or ach ieving) the alignment of the full globin family. Psi-BLAST also recognised almost all the globins when matching against the PDB sequences, typically, missing three or four of the most distantly related sequences while picking -up a few false-positives. In contrast to QUEST, Psi-BLAST performed very w ell on the larger databank, getting almost a full collection of globins alt hough still retaining the same proportion of false-positives. SAM applied t o the PDB sequences performed reasonably well with the myoglobin and hemogl obin families as probes, missing, typically several of the more difficult p roteins but performed poorly with the leghemoglobin probe. Only with the fu ll family range as a probe did it produce results comparable to Psi-BLAST a nd QUEST. With the larger databank, SAM produced a good result but, again, this was only achieved using the full range of sequence variation with the default regulariser and use of Dirichlet mixtures completely failed in this situation. (C) 1999 Elsevier Science Ltd. All rights reserved.