ITA
ENG

An assessment of gene prediction accuracy in large DNA sequences

Authors

Guigo, R Agarwal, P Abril, JF Burset, M Fickett, JW

Citation

R. Guigo et al., An assessment of gene prediction accuracy in large DNA sequences, GENOME RES, 10(10), 2000, pp. 1631-1642

Citations number

Categorie Soggetti

Molecular Biology & Genetics

Journal title

GENOME RESEARCH

ISSN journal

10889051 → ACNP

Volume

Issue

Year of publication

2000

Pages

1631 - 1642

Database

ISI

SICI code

1088-9051(200010)10:10<1631:AAOGPA>2.0.ZU;2-G

Abstract

One of the first useful products From the human genome will be a set of pre dicted genes. Besides its intrinsic scientific interest, the accuracy and c ompleteness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identifi cation in terms of both methods and accuracy evaluation measures, most of t he sequence sets in which the programs are tested are short genomic sequenc es, and there is concern that these accuracy measures may not extrapolate w ell to larger, more challenging data sets. Given the absence of experimenta lly verified large genomic data sets, we constructed a semiartificial test set comprising a number of short single-gene genomic sequences with randoml y generated intergenic regions. This test set, which should still present a n easier problem than real human genomic sequence, mimics the similar to 20 0kb long BACs being sequenced. In our experiments with these longer genomic sequences, the accuracy of GENSCAN, one of the most accurate ab initio gen e prediction programs, dropped significantly, although its sensitivity rema ined high. Conversely, the accuracy of similarity-based programs, such as G ENEWISE, PROCRUSTES, and BLASTX, was not affected significantly by the pres ence of random intergenic sequence, but depended on the strength of the sim ilarity to the protein homolog. As expected, the accuracy dropped if the mo dels were built using more distant homologs, and we were able to quantitati vely estimate this decline. However, the specificities of these techniques are still rather good even when the similarity is weak, which is a desirabl e characteristic For driving expensive Follow-up experiments. Our experimen ts suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic stru cture of every gene in the human genome using purely computational methodol ogy.