ITA
ENG

Establishing a method of vector contamination identification in database sequences

Authors

Seluja, GA Farmer, A McLeod, M Harger, C Schad, PA

Citation

Ga. Seluja et al., Establishing a method of vector contamination identification in database sequences, BIOINFORMAT, 15(2), 1999, pp. 106-110

Citations number

Categorie Soggetti

Multidisciplinary

Journal title

BIOINFORMATICS

ISSN journal

13674803 → ACNP

Volume

Issue

Year of publication

1999

Pages

106 - 110

Database

ISI

SICI code

1367-4803(199902)15:2<106:EAMOVC>2.0.ZU;2-9

Abstract

Motivation: The nucleotide sequence databases are invaluable tools both for the private and the academic research communities, from the retrieval of s equences to homology searching. Several issues related to data quality, suc h as the existence of sequencing artifacts and errors, are facing the datab ases. We investigated a major source of these errors, i.e. the presence of vector-contaminated sequences. Results: Using a panel of 180 vector polylinker sequences, we found 0.36% o r 3029 vector-matching sequences in GenBank Release 95-96, with an average vector-matching: length of 72 nucleotides. The number of vector-contaminate d sequences has been growing with the database; however, the percent contam ination has remained approximately constant at an average of 0.28% from 198 2 to 1996.