We have developed a computer program, GeneParser, which identifies and
determines the fine structure of protein genes in genomic DNA sequenc
es. The program scores all subintervals in a sequence for content stat
istics indicative of introns and exons, and for sites that identify th
eir boundaries. This information is weighted by a neural network to ap
proximate the log-likelihood that each subinterval exactly represents
an intron or exon (first, internal or last). A dynamic programming alg
orithm is then applied to this data to find the combination of introns
and exons that maximizes the likelihood function. Using this method,
we can rapidly generate ranked suboptimal solutions, each of which is
the optimum solution containing a given intron-exon junction. We have
tested the system on a large collection of human genes. On sequences n
ot used in training, we achieved a correlation coefficient for exon nu
cleotide prediction of 0.89. For a subset of G + C-rich genes, a corre
lation coefficient of 0.94 was achieved. We have also quantified the r
obustness of the method to substitution and frame-shift errors and sho
w how the system can be optimized for performance on sequences with kn
own levels of sequencing errors.