C. Delamarche et al., A symbolic-numeric approach to find patterns in genomes. Application to the translation initiation sites of E-coli., BIOCHIMIE, 81(11), 1999, pp. 1065-1072
DNA sequence data provided by genome sequencing programs open new research
prospects. in this respect, computational investigations are of major impor
tance to discover new 'functional/structural patterns' and to improve biolo
gical process knowledge. For example, even though the principal steps of tr
anslation initiation in prokaryotes are known, it is difficult to point out
the exact pattern of the mRNA that is recognized by the ribosome. in this
study, we have carried out a systematic context analysis of the complete ge
nome of E. coli, around codons in competition for translation initiation. U
sing a combinatorial approach, we first show that it is possible to accurat
ely define the initiation site by looking for the localization of patterns
representing various combinations of trinucleotides. We have combined this
approach with a statistical analysis based on the frequencies of these patt
erns. This lends to a decision tree, able to discriminate true and false st
arts with a recognition level near 90%. Our method may help to precisely lo
calize the beginning of open reading frames, and point to likely mistakes f
or some genes in the database. The method may be included as a component of
a gene recognition system, is not restricted to a particular genome or a t
wo-classes discrimination, and may be applied to a broader class of biologi
cal patterns. (C) Societe francaise de biochimie et biologie moleculaire/Ed
itions scientifiques et medicales Elsevier SAS.