During translational initiation in prokaryotes, the 3' end of the 16S rRNA
binds to a region just upstream of the initiation codon. The relationship b
etween this Shine-Dalgarno (SD) region and the binding of ribosomes to tran
slation start-points has been well studied, but a unified mathematical conn
ection between the SD, the initiation codon and the spacing between them ha
s been lacking. Using information theory, we constructed a model that treat
s these three components uniformly by assigning to the SD and the initiatio
n region (IR) conservations in bits of information, and by assigning to the
spacing an uncertainty, also in bits. To build the model, we first aligned
the SD region by maximizing the information content there. The ease of thi
s process confirmed the existence of the SD pattern within a set of 4122 re
viewed and revised Escherichia coli gene starts. This large data set allowe
d us to show graphically, by sequence logos, that the spacing between the S
D and the initiation region affects both the SD site conservation and its p
attern. We used the aligned SD, the spacing, and the initiation region to m
odel ribosome binding and to identify gene starts that do not conform to th
e ribosome binding site model. A total of 569 experimentally proven starts
are more conserved (have higher information content) than the full set of r
evised starts, which probably reflects an experimental bias against the det
ection of gene products that have inefficient ribosome binding sites. Model
s were refined cyclically by removing non-conforming weak sites. After this
procedure, models derived from either the original or the revised gene sta
rt annotation were similar. Therefore, this information theory-based techni
que provides a method for easily constructing biologically sensible ribosom
e binding site models. Such models should be useful for refining gene-start
predictions of any sequenced bacterial genome. (C) 2001 Academic Press.