Establishing a method of vector contamination identification in database sequences

Citation
Ga. Seluja et al., Establishing a method of vector contamination identification in database sequences, BIOINFORMAT, 15(2), 1999, pp. 106-110
Citations number
18
Categorie Soggetti
Multidisciplinary
Journal title
BIOINFORMATICS
ISSN journal
13674803 → ACNP
Volume
15
Issue
2
Year of publication
1999
Pages
106 - 110
Database
ISI
SICI code
1367-4803(199902)15:2<106:EAMOVC>2.0.ZU;2-9
Abstract
Motivation: The nucleotide sequence databases are invaluable tools both for the private and the academic research communities, from the retrieval of s equences to homology searching. Several issues related to data quality, suc h as the existence of sequencing artifacts and errors, are facing the datab ases. We investigated a major source of these errors, i.e. the presence of vector-contaminated sequences. Results: Using a panel of 180 vector polylinker sequences, we found 0.36% o r 3029 vector-matching sequences in GenBank Release 95-96, with an average vector-matching: length of 72 nucleotides. The number of vector-contaminate d sequences has been growing with the database; however, the percent contam ination has remained approximately constant at an average of 0.28% from 198 2 to 1996.