N. Pavy et al., Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences, BIOINFORMAT, 15(11), 1999, pp. 887-899
Motivation: The annotation of the Arabidopsis thaliana genome remains a pro
blem in terms of time and quality. To improve the annotation process, we wa
nt to choose the most appropriate tools to use inside a computer-assisted a
nnotation platform. We therefore need evaluation of prediction programs wit
h Arabidopsis sequences containing multiple genes.
Results: We have developed AraSet, a data set of contigs of validated genes
, enabling the evaluation of multi-gene models for the Arabidopsis genome.
Besides conventional metrics to evaluate gene prediction at the site and th
e exon levels, new measures were introduced for the prediction at the prote
in sequence level as well as for the evaluation of gene models. This evalua
tion method is of general interest and could apply to any new gene predicti
on software and to any eukaryotic genome. The GeneMark.hmm program appears
to be the most accurate software at all three level's for the Arabidopsis g
enomic sequences. Gene modeling could be further improved by combination of
prediction software.
Availability: The AraSet sequence set, the Perl programs and complementary
results and notes are available at http://sphinx.rug.ac.be:8080/biocomp/nap
av/.
Contact: Pierre.Rouze@gengenp.rug.ac.be.