ITA
ENG

A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases

Authors

Miller, C Gurd, J Brass, A

Citation

C. Miller et al., A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases, BIOINFORMAT, 15(2), 1999, pp. 111-121

Citations number

Categorie Soggetti

Multidisciplinary

Journal title

BIOINFORMATICS

ISSN journal

13674803 → ACNP

Volume

Issue

Year of publication

1999

Pages

111 - 121

Database

ISI

SICI code

1367-4803(199902)15:2<111:ARAFSD>2.0.ZU;2-3

Abstract

Motivation: Word-matching algorithms such as BLAST are routinely, used for sequence comparison. These algorithms typically, use areas of matching word s to seen alignments which are then Leed to assess the degree of sequence s imilarity. In this paper we show that by formally separating the word-match ing and sequence-alignment process, and using information about word freque ncies to generate alignments and similarity scores, we can create a new seq uence-comparison algorithm which is both fast and sensitive. The formal spl it between word searching and alignment allows users to select an appropria te alignment method without affecting the underlying similarity search. The algorithm has been used to develop software for identifying entries in DNA sequence databases which are contaminated with vector sequence. Results: We present three algorithms, RAPID, PHAT and SPLAT which together allow vector contaminations to be found and assessed extremely rapidly RAPI D is a word search algorithm which uses probabilities to modify the signifi cance attached to different words; PHAT and SPLAT are alignment algorithms. An initial implementation has been shown to be approximately an order of m agnitude faster than BLAST The formal split between word searching and alig nment not only offer's considerable gains in performance, bur also allows a lignment generation to be viewed as a riser interface problem, allowing the most useful output method to be selected without affecting the underlying similarity search. Receiver Operator Characteristic (ROC) analysis of an ar tificial test set allows the optimal score threshold for identifying vector contamination to be determined ROC curves were also used to determine the optimum word size (nine) for finding vector contamination. An analysis of t he entire expressed sequence tag (EST) subset of EMBL found a contamination rate of 0.27%. A more detailed analysis of the 50 000 ESTs in est10.dat (a n EST subset of EMBL) finds art error rate of 0.86%, principally due to two large-scale projects.