L. Marsan et Mf. Sagot, Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification, J COMPUT BI, 7(3-4), 2000, pp. 345-362
This paper introduces two exact algorithms for extracting conserved structu
red motifs from a set of DNA sequences. Structured motifs may be described
as an ordered collection of p greater than or equal to 1 "boxes" (each box
corresponding to one part of the structured moth), p substitution rates (on
e for each box) and p - 1 intervals of distance (one for each pair of succe
ssive boxes in the collection). The contents of the boxes - that is, the mo
tifs themselves - are unknown at the start of the algorithm. This is precis
ely what the algorithms are meant to find. A suffix tree is used for findin
g such moths, The algorithms are efficient enough to be able to infer site
consensi, such as, for instance, promoter sequences or regulatory sites, fr
om a set of unaligned sequences corresponding to the noncoding regions upst
ream from all genes of a genome. In particular, both algorithms time comple
xity scales linearly with N(2)n where n is the average length of the sequen
ces and N their number. An application to the identification of promoter an
d regulatory consensus sequences in bacterial genomes is shown.