Motivation: Whole genome shotgun sequencing strategies generate sequence da
ta prior to the application of assembly methodologies that result in contig
uous sequence. Sequence reads can be employed to indicate regions of conser
vation between closely related species for which only one genome has been a
ssembled. Consequently, by using pairwise sequence alignments methods it is
possible to identify novel, non-repetitive, conserved segments in non-codi
ng sequence that exist between the assembled human genome and mouse whole g
enome shotgun sequencing fragments. Conserved non-coding regions identify p
otentially functional DNA that could be involved in transcriptional regulat
ion.
Results: Local sequence alignment methods were applied employing mouse frag
ments and the assembled human genome. In addition, transcription factor bin
ding sites were detected by aligning their corresponding positional weight
matrices to the sequence regions. These methods were applied to a set of tr
anscripts corresponding to 502 genes associated with a variety of different
human diseases taken from the Online Mendelian Inheritance in Man database
. Using statistical arguments we have shown that conserved non-coding segme
nts contain an enrichment of transcription factor binding sites when compar
ed to the sequence background in which the conserved segments are located.
This enrichment of binding sites was not observed in coding sequence. Conse
rved non-coding segments are not extensively repeated in the genome and the
refore their identification provides a rapid means of finding genes with re
lated conserved regions, and consequently potentially related regulatory me
chanism. Conserved segments in upstream regions are found to contain bindin
g sites that are co-localized in a manner consistent with experimentally kn
own transcription factor pairwise co-occurrences and afford the identificat
ion of novel co-occurring Transcription Factor (TF) pairs. This study provi
des a methodology and more evidence to suggest that conserved non-coding re
gions are biologically significant since they contain a statistical enrichm
ent of regulatory signals and pairs of signals that enable the construction
of regulatory models for human genes.