Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences

Citation
N. Pavy et al., Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences, BIOINFORMAT, 15(11), 1999, pp. 887-899
Citations number
34
Categorie Soggetti
Multidisciplinary
Journal title
BIOINFORMATICS
ISSN journal
13674803 → ACNP
Volume
15
Issue
11
Year of publication
1999
Pages
887 - 899
Database
ISI
SICI code
1367-4803(199911)15:11<887:EOGPSU>2.0.ZU;2-Z
Abstract
Motivation: The annotation of the Arabidopsis thaliana genome remains a pro blem in terms of time and quality. To improve the annotation process, we wa nt to choose the most appropriate tools to use inside a computer-assisted a nnotation platform. We therefore need evaluation of prediction programs wit h Arabidopsis sequences containing multiple genes. Results: We have developed AraSet, a data set of contigs of validated genes , enabling the evaluation of multi-gene models for the Arabidopsis genome. Besides conventional metrics to evaluate gene prediction at the site and th e exon levels, new measures were introduced for the prediction at the prote in sequence level as well as for the evaluation of gene models. This evalua tion method is of general interest and could apply to any new gene predicti on software and to any eukaryotic genome. The GeneMark.hmm program appears to be the most accurate software at all three level's for the Arabidopsis g enomic sequences. Gene modeling could be further improved by combination of prediction software. Availability: The AraSet sequence set, the Perl programs and complementary results and notes are available at http://sphinx.rug.ac.be:8080/biocomp/nap av/. Contact: Pierre.Rouze@gengenp.rug.ac.be.