FINDING INTRON EXON SPLICE JUNCTIONS USING INFO, INTERRUPTION FINDER AND ORGANIZER/

Authors
Citation
Mt. Laub et Dw. Smith, FINDING INTRON EXON SPLICE JUNCTIONS USING INFO, INTERRUPTION FINDER AND ORGANIZER/, Journal of computational biology, 5(2), 1998, pp. 307-321
Citations number
23
Categorie Soggetti
Mathematics,Biology,"Biochemical Research Methods",Mathematics,"Biothechnology & Applied Migrobiology
ISSN journal
10665277
Volume
5
Issue
2
Year of publication
1998
Pages
307 - 321
Database
ISI
SICI code
1066-5277(1998)5:2<307:FIESJU>2.0.ZU;2-V
Abstract
INFO, INterruption Finder and Organizer, has been used to find coding sequence intron-exon splice junctions in human and other DNA by compar ing the six conceptual translations of the input DNA sequence with seq uences in protein databanks using a similarity matrix and windowing al gorithm. Similarities detected both delineate position of the gene and provide clues as to the function of the gene product. In addition to use of a standard similarity matrix and windowing algorithm, INFO uses two novel steps, the MiniLibrary and Reverse Sequence steps, to enhan ce identification of small exons and to improve precision of junction nucleotide delineation, Exons as small as about 30 bases can be reliab ly found, and >90% of junctions are precisely identified when canonica l splice junction information is used. With the MiniLibrary and Revers e Sequence steps, INFO parameters need not be optimized by the user. I n comparative test runs using 19 human DNA sequences, INFO found 108 o f 111 exons, with 0 reported false positives, compared with 111 exons and 51 false positives for BLASTX, 99 exons and 6 false positives for GRAIL II, 77 exons and 24 false positives for GeneMark, 61 exons and 9 false positives for GeneID, and 105 exons and 6 false positives for P ROCRUSTES, The correlation coefficient for finding and positioning the se 111 exons was greater than 98% for INFO, Comparable results were ob tained in test runs of 13 nonhuman DNA sequences. INFO is applicable t o DNA from any species, will become more robust as sequence databanks expand, and complements other heuristic approaches.