Heterologous DNA sequences from rearrangements with the genomes of hos
t cells, genomic fragments from hybrid cells, or impure tissue sources
can threaten the purity of libraries that are derived from RNA or DNA
. Hybridization methods can only detect contaminants from known or sus
pected heterologous sources, and whole library screening is technicall
y very difficult. Detection of contaminating heterologous clones by se
quence alignment is only possible when related sequences are present i
n a known database. We have developed a statistical test to identify h
eterologous sequences that is based on the differences in hexamer comp
osition of DNA from different organisms. This test does not require th
at sequences similar to potential heterologous contaminants are presen
t in the database, and can in principle detect contamination by previo
usly unknown organisms. We have applied this test to the major public
expressed sequence tag (EST) data sets to evaluate its utility as a qu
ality control measure and a peer evaluation tool. There is detectable
heterogeneity in most human and C.elegans EST data sets but it is not
apparently associated with cross-species contamination. However, there
is direct evidence for both yeast and bacterial sequence contaminatio
n in some public database sequences annotated as human. Results obtain
ed with the hexamer test have been confirmed with similarity searches
using sequences from the relevant data sets.