Genome annotation assessment in Drosophila melanogaster

Citation
Mg. Reese et al., Genome annotation assessment in Drosophila melanogaster, GENOME RES, 10(4), 2000, pp. 483-501
Citations number
53
Categorie Soggetti
Molecular Biology & Genetics
Journal title
GENOME RESEARCH
ISSN journal
10889051 → ACNP
Volume
10
Issue
4
Year of publication
2000
Pages
483 - 501
Database
ISI
SICI code
1088-9051(200004)10:4<483:GAAIDM>2.0.ZU;2-7
Abstract
Computational methods for automated genome annotation are critical to our c ommunity's ability to make full use of the large volume of genomic sequence being generated and released. To explore the accuracy of these automated f eature prediction tools in the genomes of higher organisms, we evaluated th eir performance on a large, well-characterized sequence contig from the Adh region of Drosophila melanogaster. This experiment, known as the Genome An notation Assessment Project (GASP), was launched in May 1999. Twelve groups , applying state-of-the-art tools, contributed predictions for features inc luding gene structure, protein homologies, promoter sires, and repeat eleme nts. We evaluated these predictions using two standards, one based on previ ously unreleased high-quality full-length cDNA sequences and a second based on the set of annotations generated as part of an in-depth study of the re gion by a group of Drosophila experts. Although these standard sets only ap proximate the unknown distribution of Features in this region, we believe t hat when taken in context the results of an evaluation based on them are me aningful. The results were presented as a tutorial at the conference on int elligent Systems in Molecular Biology (ISMB-99) in August 1999. Over 95% of the coding nucleotides in the region were correctly identified by the majo rity of the gene finders, and the correct intron/exon structures were predi cted For >40% of the genes. Homology-based annotation techniques recognized and associated functions with almost half of the genes in the region; the remainder were only identified by the ab initio techniques. This experiment also presents the first assessment of promoter prediction techniques for a significant number of genes in a large contiguous region. We discovered th at the promoter predictors' high false-positive rates make their prediction s difficult to use. Integrating gene Finding and cDNA/EST alignments with p romoter predictions decreases the number of false-positive classifications but discovers less than one-third of the promoters in the region. We believ e that by establishing standards for evaluating genomic annotations and by assessing the performance of existing automated genome annotation tools, th is experiment establishes a baseline that contributes to the value of ongoi ng large-scale annotation projects and should guide further research in gen ome informatics.