In a number of programs for gene structure prediction in higher eukaryotic
genomic sequences, exon prediction is decoupled from gene assembly: a large
pool of candidate exons is predicted and scored from features located in t
he query DNA sequence, and candidate genes are assembled from such a pool a
s sequences of nonoverlapping frame-compatible exons. Genes are scored as a
function of the scores of the assembled exons, and the highest scoring can
didate gene is assumed to be the most likely gene encoded by the query DNA
sequence. Considering additive gene scoring functions, currently available
algorithms to determine such a highest scoring candidate gene run in time p
roportional to the square of the number of predicted exons. Here, we presen
t an algorithm whose running time grows only linearly with the size of the
set of predicted exons. Polynomial algorithms rely on the fact that, while
scanning the set of predicted exons, the highest scoring gene ending in a g
iven exon can be obtained by appending the exon to the highest scoring amon
g the highest scoring genes ending at each compatible preceding exon. The a
lgorithm here relies on the simple fact that such highest scoring gene can
be stored and updated. This requires scanning the set of predicted exons si
multaneously by increasing acceptor and donor position. On the other hand,
the algorithm described here does not assume an underlying gene structure m
odel. Indeed, the definition of valid gene structures is externally defined
in the so-called Gene Model. The Gene Model specifies simply which gene fe
atures are allowed immediately upstream which other gene features in valid
gene structures. This allows for great flexibility in formulating the gene
identification problem. In particular it allows for multiple-gene two-stran
d predictions and for considering gene features other than coding exons (su
ch as promoter elements) in valid gene structures.