ITA
ENG

AUTOMATED GENE IDENTIFICATION IN LARGE-SCALE GENOMIC SEQUENCES

Authors

XU Y UBERBACHER EC

Citation

Y. Xu et Ec. Uberbacher, AUTOMATED GENE IDENTIFICATION IN LARGE-SCALE GENOMIC SEQUENCES, Journal of computational biology, 4(3), 1997, pp. 325-338

Citations number

Categorie Soggetti

Mathematical Methods, Biology & Medicine",Mathematics,Biology,"Biochemical Research Methods",Mathematics,"Biothechnology & Applied Migrobiology

Journal title

Journal of computational biology → ACNP

ISSN journal

10665277

Volume

Issue

Year of publication

1997

Pages

325 - 338

Database

ISI

SICI code

1066-5277(1997)4:3<325:AGIILG>2.0.ZU;2-V

Abstract

Computational methods for gene identification in genomic sequences typ ically have two phases: coding region recognition and gene parsing, Wh ile there are a number of effective methods for recognizing coding reg ions (exons), parsing the recognized exons into proper gene structures , to a large extent, remains an unsolved problem, We have developed a computer program which can automatically parse the recognized exons in to gene models that are most consistent with the available Expressed S equence Tags (ESTs) and a set of biological heuristics, derived empiri cally, The gene modeling algorithm used in this program provides a gen eral framework for applying EST information so the modeling accuracy i mproves as the amount of available EST information increases, Based on preliminary tests on a number of large DNA sequences, using the dbEST database, we have observed that the algorithm can (1) accurately mode l complicated multiple gene structures, including embedded genes, (2) identify falsely-recognized exons and locate missed exons by the initi al exon recognition phase, and (3) make more accurate exon boundary pr edictions, if the necessary EST information is available, We have exte nded this EST-based gene modeling algorithm to model genes on unfinish ed DNA contigs at the end of the shotgun sequencing, This extended ver sion can automatically determine the orientations and the relative ord er of the DNA contigs (with gaps between them) using the available EST s as reference models, before the gene modeling phase.