Motivation: The nucleotide sequence databases are invaluable tools both for
the private and the academic research communities, from the retrieval of s
equences to homology searching. Several issues related to data quality, suc
h as the existence of sequencing artifacts and errors, are facing the datab
ases. We investigated a major source of these errors, i.e. the presence of
vector-contaminated sequences.
Results: Using a panel of 180 vector polylinker sequences, we found 0.36% o
r 3029 vector-matching sequences in GenBank Release 95-96, with an average
vector-matching: length of 72 nucleotides. The number of vector-contaminate
d sequences has been growing with the database; however, the percent contam
ination has remained approximately constant at an average of 0.28% from 198
2 to 1996.