A QUALITY-CONTROL ALGORITHM FOR DNA-SEQUENCING PROJECTS

Citation
O. White et al., A QUALITY-CONTROL ALGORITHM FOR DNA-SEQUENCING PROJECTS, Nucleic acids research, 21(16), 1993, pp. 3829-3838
Citations number
25
Categorie Soggetti
Biology
Journal title
ISSN journal
03051048
Volume
21
Issue
16
Year of publication
1993
Pages
3829 - 3838
Database
ISI
SICI code
0305-1048(1993)21:16<3829:AQAFDP>2.0.ZU;2-E
Abstract
Heterologous DNA sequences from rearrangements with the genomes of hos t cells, genomic fragments from hybrid cells, or impure tissue sources can threaten the purity of libraries that are derived from RNA or DNA . Hybridization methods can only detect contaminants from known or sus pected heterologous sources, and whole library screening is technicall y very difficult. Detection of contaminating heterologous clones by se quence alignment is only possible when related sequences are present i n a known database. We have developed a statistical test to identify h eterologous sequences that is based on the differences in hexamer comp osition of DNA from different organisms. This test does not require th at sequences similar to potential heterologous contaminants are presen t in the database, and can in principle detect contamination by previo usly unknown organisms. We have applied this test to the major public expressed sequence tag (EST) data sets to evaluate its utility as a qu ality control measure and a peer evaluation tool. There is detectable heterogeneity in most human and C.elegans EST data sets but it is not apparently associated with cross-species contamination. However, there is direct evidence for both yeast and bacterial sequence contaminatio n in some public database sequences annotated as human. Results obtain ed with the hexamer test have been confirmed with similarity searches using sequences from the relevant data sets.