A. Vanet et al., Inferring regulatory elements from a whole genome. An analysis of Helicobacter pylori sigma(80) family of promoter signals, J MOL BIOL, 297(2), 2000, pp. 335-353
Helicobacter pylori is adapted to life in a unique niche, the gastric epith
elium of primates. Its promoters may therefore be different from those of o
ther bacteria. Here, we determine motifs possibly involved in the recogniti
on of such promoter sequences by the RNA polymerase using a new motif ident
ification method. An important feature of this method is that the motifs ar
e sought with the least possible assumptions about what they may look like.
The method starts by considering the whole genome of H. pylori and attempt
s to infer directly from it a description for a family of promoters. Thus,
this approach differs from searching for such promoters with a previously e
stablished description. The two algorithms are based on the idea of inferri
ng motifs by flexibly comparing words in the sequences with an external obj
ect, instead of between themselves. The first algorithm infers single motif
s, the second a combination of two motifs separated from one another by str
ictly defined, sterically constrained distances. Besides independently find
ing motifs known to be present in other bacteria, such as the Shine-Dalgarn
o sequence and the TATA-box, this approach suggests the existence in H. pyl
ori of a new, combined motif, TTAAGC, followed optimally 21 bp downstream b
y TATAAT. Between these two motifs, there is in some cases another, TTTTAA
or, less frequently, a repetition of TTAAGC separated optimally from the TA
TA-box by 12 bp. The combined motif TTAAGC x (21 +/- 2)TATAAT is present wi
th no errors immediately upstream from the only two copies of the ribosomal
23 S-5 S RNA genes in H. pylori, and with one error upstream from the only
two copies of the ribosomal 16 S RNA genes. The operons of both ribosomal
RNA molecules are strongly expressed, representing an encouraging sign of t
he pertinence of the motifs found by the algorithms. In 25 cases out of a p
ossible 30, the combined motif is found with no more than three substitutio
ns immediately upstream from ribosomal proteins, or operons containing a ri
bosomal protein. This is roughly the same frequency of occurrence as for TT
GACA x (15-19)TATAAT (with the same maximum number of substitutions allowed
) described as being the sigma(70) promoter sequence consensus in Bacillus
subtilis and Escherichia coli. The frequency of occurrence of the new motif
obtained, TTAAGC x (19-23)TATAAT, remains high when all protein genes in H
. pylori are considered, as is the case for the TTGACA x (15-19)TATAAT moti
f in B. subtilis but not in E. coli. (C) 2000 Academic Press.