IDENTIFICATION OF PROTEIN-CODING REGIONS IN GENOMIC DNA

Citation
Ee. Snyder et Gd. Stormo, IDENTIFICATION OF PROTEIN-CODING REGIONS IN GENOMIC DNA, Journal of Molecular Biology, 248(1), 1995, pp. 1-18
Citations number
53
Categorie Soggetti
Biology
ISSN journal
00222836
Volume
248
Issue
1
Year of publication
1995
Pages
1 - 18
Database
ISI
SICI code
0022-2836(1995)248:1<1:IOPRIG>2.0.ZU;2-F
Abstract
We have developed a computer program, GeneParser, which identifies and determines the fine structure of protein genes in genomic DNA sequenc es. The program scores all subintervals in a sequence for content stat istics indicative of introns and exons, and for sites that identify th eir boundaries. This information is weighted by a neural network to ap proximate the log-likelihood that each subinterval exactly represents an intron or exon (first, internal or last). A dynamic programming alg orithm is then applied to this data to find the combination of introns and exons that maximizes the likelihood function. Using this method, we can rapidly generate ranked suboptimal solutions, each of which is the optimum solution containing a given intron-exon junction. We have tested the system on a large collection of human genes. On sequences n ot used in training, we achieved a correlation coefficient for exon nu cleotide prediction of 0.89. For a subset of G + C-rich genes, a corre lation coefficient of 0.94 was achieved. We have also quantified the r obustness of the method to substitution and frame-shift errors and sho w how the system can be optimized for performance on sequences with kn own levels of sequencing errors.