A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases

Citation
C. Miller et al., A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases, BIOINFORMAT, 15(2), 1999, pp. 111-121
Citations number
17
Categorie Soggetti
Multidisciplinary
Journal title
BIOINFORMATICS
ISSN journal
13674803 → ACNP
Volume
15
Issue
2
Year of publication
1999
Pages
111 - 121
Database
ISI
SICI code
1367-4803(199902)15:2<111:ARAFSD>2.0.ZU;2-3
Abstract
Motivation: Word-matching algorithms such as BLAST are routinely, used for sequence comparison. These algorithms typically, use areas of matching word s to seen alignments which are then Leed to assess the degree of sequence s imilarity. In this paper we show that by formally separating the word-match ing and sequence-alignment process, and using information about word freque ncies to generate alignments and similarity scores, we can create a new seq uence-comparison algorithm which is both fast and sensitive. The formal spl it between word searching and alignment allows users to select an appropria te alignment method without affecting the underlying similarity search. The algorithm has been used to develop software for identifying entries in DNA sequence databases which are contaminated with vector sequence. Results: We present three algorithms, RAPID, PHAT and SPLAT which together allow vector contaminations to be found and assessed extremely rapidly RAPI D is a word search algorithm which uses probabilities to modify the signifi cance attached to different words; PHAT and SPLAT are alignment algorithms. An initial implementation has been shown to be approximately an order of m agnitude faster than BLAST The formal split between word searching and alig nment not only offer's considerable gains in performance, bur also allows a lignment generation to be viewed as a riser interface problem, allowing the most useful output method to be selected without affecting the underlying similarity search. Receiver Operator Characteristic (ROC) analysis of an ar tificial test set allows the optimal score threshold for identifying vector contamination to be determined ROC curves were also used to determine the optimum word size (nine) for finding vector contamination. An analysis of t he entire expressed sequence tag (EST) subset of EMBL found a contamination rate of 0.27%. A more detailed analysis of the 50 000 ESTs in est10.dat (a n EST subset of EMBL) finds art error rate of 0.86%, principally due to two large-scale projects.