Assembling genes from predicted exons in linear time with dynamic programming

Authors
Citation
R. Guigo, Assembling genes from predicted exons in linear time with dynamic programming, J COMPUT BI, 5(4), 1998, pp. 681-702
Citations number
26
Categorie Soggetti
Biochemistry & Biophysics
Journal title
JOURNAL OF COMPUTATIONAL BIOLOGY
ISSN journal
10665277 → ACNP
Volume
5
Issue
4
Year of publication
1998
Pages
681 - 702
Database
ISI
SICI code
1066-5277(199824)5:4<681:AGFPEI>2.0.ZU;2-L
Abstract
In a number of programs for gene structure prediction in higher eukaryotic genomic sequences, exon prediction is decoupled from gene assembly: a large pool of candidate exons is predicted and scored from features located in t he query DNA sequence, and candidate genes are assembled from such a pool a s sequences of nonoverlapping frame-compatible exons. Genes are scored as a function of the scores of the assembled exons, and the highest scoring can didate gene is assumed to be the most likely gene encoded by the query DNA sequence. Considering additive gene scoring functions, currently available algorithms to determine such a highest scoring candidate gene run in time p roportional to the square of the number of predicted exons. Here, we presen t an algorithm whose running time grows only linearly with the size of the set of predicted exons. Polynomial algorithms rely on the fact that, while scanning the set of predicted exons, the highest scoring gene ending in a g iven exon can be obtained by appending the exon to the highest scoring amon g the highest scoring genes ending at each compatible preceding exon. The a lgorithm here relies on the simple fact that such highest scoring gene can be stored and updated. This requires scanning the set of predicted exons si multaneously by increasing acceptor and donor position. On the other hand, the algorithm described here does not assume an underlying gene structure m odel. Indeed, the definition of valid gene structures is externally defined in the so-called Gene Model. The Gene Model specifies simply which gene fe atures are allowed immediately upstream which other gene features in valid gene structures. This allows for great flexibility in formulating the gene identification problem. In particular it allows for multiple-gene two-stran d predictions and for considering gene features other than coding exons (su ch as promoter elements) in valid gene structures.