An assessment of gene prediction accuracy in large DNA sequences

Citation
R. Guigo et al., An assessment of gene prediction accuracy in large DNA sequences, GENOME RES, 10(10), 2000, pp. 1631-1642
Citations number
25
Categorie Soggetti
Molecular Biology & Genetics
Journal title
GENOME RESEARCH
ISSN journal
10889051 → ACNP
Volume
10
Issue
10
Year of publication
2000
Pages
1631 - 1642
Database
ISI
SICI code
1088-9051(200010)10:10<1631:AAOGPA>2.0.ZU;2-G
Abstract
One of the first useful products From the human genome will be a set of pre dicted genes. Besides its intrinsic scientific interest, the accuracy and c ompleteness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identifi cation in terms of both methods and accuracy evaluation measures, most of t he sequence sets in which the programs are tested are short genomic sequenc es, and there is concern that these accuracy measures may not extrapolate w ell to larger, more challenging data sets. Given the absence of experimenta lly verified large genomic data sets, we constructed a semiartificial test set comprising a number of short single-gene genomic sequences with randoml y generated intergenic regions. This test set, which should still present a n easier problem than real human genomic sequence, mimics the similar to 20 0kb long BACs being sequenced. In our experiments with these longer genomic sequences, the accuracy of GENSCAN, one of the most accurate ab initio gen e prediction programs, dropped significantly, although its sensitivity rema ined high. Conversely, the accuracy of similarity-based programs, such as G ENEWISE, PROCRUSTES, and BLASTX, was not affected significantly by the pres ence of random intergenic sequence, but depended on the strength of the sim ilarity to the protein homolog. As expected, the accuracy dropped if the mo dels were built using more distant homologs, and we were able to quantitati vely estimate this decline. However, the specificities of these techniques are still rather good even when the similarity is weak, which is a desirabl e characteristic For driving expensive Follow-up experiments. Our experimen ts suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic stru cture of every gene in the human genome using purely computational methodol ogy.