Computational methods for gene identification in genomic sequences typ
ically have two phases: coding region recognition and gene parsing, Wh
ile there are a number of effective methods for recognizing coding reg
ions (exons), parsing the recognized exons into proper gene structures
, to a large extent, remains an unsolved problem, We have developed a
computer program which can automatically parse the recognized exons in
to gene models that are most consistent with the available Expressed S
equence Tags (ESTs) and a set of biological heuristics, derived empiri
cally, The gene modeling algorithm used in this program provides a gen
eral framework for applying EST information so the modeling accuracy i
mproves as the amount of available EST information increases, Based on
preliminary tests on a number of large DNA sequences, using the dbEST
database, we have observed that the algorithm can (1) accurately mode
l complicated multiple gene structures, including embedded genes, (2)
identify falsely-recognized exons and locate missed exons by the initi
al exon recognition phase, and (3) make more accurate exon boundary pr
edictions, if the necessary EST information is available, We have exte
nded this EST-based gene modeling algorithm to model genes on unfinish
ed DNA contigs at the end of the shotgun sequencing, This extended ver
sion can automatically determine the orientations and the relative ord
er of the DNA contigs (with gaps between them) using the available EST
s as reference models, before the gene modeling phase.