ITA
ENG

Iterated sequence databank search methods

Authors

Taylor, WR Brown, NP

Citation

Wr. Taylor et Np. Brown, Iterated sequence databank search methods, COMPUT CHEM, 23(3-4), 1999, pp. 365-385

Citations number

Categorie Soggetti

Chemistry

Journal title

COMPUTERS & CHEMISTRY

ISSN journal

00978485 → ACNP

Volume

Issue

3-4

Year of publication

1999

Pages

365 - 385

Database

ISI

SICI code

0097-8485(1999)23:3-4<365:ISDSM>2.0.ZU;2-U

Abstract

Iterated sequence databank search methods were assessed from the viewpoint of someone with the sequence of a novel gene product wishing to find distan t relatives to their protein and, with the specific searches against the PD B, also hoping to find a relative of known structure. We examined three met hods in detail, spanning a range from simple pattern-matching to sophistica ted weighted profiles. Rather than apply these methods 'blindly' (with defa ult parameters) to a large number of test queries, we have concentrated on the globins, so allowing a more detailed investigation of each method on di fferent data subsets with different parameter settings. Despite their wides pread use, regular-expression matching proved to be very limited-seldom ext ending beyond the sub-family from which the pattern was derived. To attain any generality, the patterns had to be 'stripped-down' to include only the most highly conserved parts. The QUEST program avoided these problems by in troducing a more flexible (weighted) matching. On the PDB sequences this wa s highly effective, missing only a few globins with probes based on each su b-family or even a single representative from each sub-family. In addition, very few false-positives were encountered, and those that did match, often only did so for a few cycles before being lost again. On the larger sequen ce collection, however, QUEST encountered problems with maintaining (or ach ieving) the alignment of the full globin family. Psi-BLAST also recognised almost all the globins when matching against the PDB sequences, typically, missing three or four of the most distantly related sequences while picking -up a few false-positives. In contrast to QUEST, Psi-BLAST performed very w ell on the larger databank, getting almost a full collection of globins alt hough still retaining the same proportion of false-positives. SAM applied t o the PDB sequences performed reasonably well with the myoglobin and hemogl obin families as probes, missing, typically several of the more difficult p roteins but performed poorly with the leghemoglobin probe. Only with the fu ll family range as a probe did it produce results comparable to Psi-BLAST a nd QUEST. With the larger databank, SAM produced a good result but, again, this was only achieved using the full range of sequence variation with the default regulariser and use of Dirichlet mixtures completely failed in this situation. (C) 1999 Elsevier Science Ltd. All rights reserved.