C. Miller et al., A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases, BIOINFORMAT, 15(2), 1999, pp. 111-121
Motivation: Word-matching algorithms such as BLAST are routinely, used for
sequence comparison. These algorithms typically, use areas of matching word
s to seen alignments which are then Leed to assess the degree of sequence s
imilarity. In this paper we show that by formally separating the word-match
ing and sequence-alignment process, and using information about word freque
ncies to generate alignments and similarity scores, we can create a new seq
uence-comparison algorithm which is both fast and sensitive. The formal spl
it between word searching and alignment allows users to select an appropria
te alignment method without affecting the underlying similarity search. The
algorithm has been used to develop software for identifying entries in DNA
sequence databases which are contaminated with vector sequence.
Results: We present three algorithms, RAPID, PHAT and SPLAT which together
allow vector contaminations to be found and assessed extremely rapidly RAPI
D is a word search algorithm which uses probabilities to modify the signifi
cance attached to different words; PHAT and SPLAT are alignment algorithms.
An initial implementation has been shown to be approximately an order of m
agnitude faster than BLAST The formal split between word searching and alig
nment not only offer's considerable gains in performance, bur also allows a
lignment generation to be viewed as a riser interface problem, allowing the
most useful output method to be selected without affecting the underlying
similarity search. Receiver Operator Characteristic (ROC) analysis of an ar
tificial test set allows the optimal score threshold for identifying vector
contamination to be determined ROC curves were also used to determine the
optimum word size (nine) for finding vector contamination. An analysis of t
he entire expressed sequence tag (EST) subset of EMBL found a contamination
rate of 0.27%. A more detailed analysis of the 50 000 ESTs in est10.dat (a
n EST subset of EMBL) finds art error rate of 0.86%, principally due to two
large-scale projects.