ITA
ENG

Automated genome sequence analysis and annotation

Authors

Andrade, MA Brown, NP Leroy, C Hoersch, S de Daruvar, A Reich, C Franchini, A Tamames, J Valencia, A Ouzounis, C Sander, C

Citation

Ma. Andrade et al., Automated genome sequence analysis and annotation, BIOINFORMAT, 15(5), 1999, pp. 391-412

Citations number

Categorie Soggetti

Multidisciplinary

Journal title

BIOINFORMATICS

ISSN journal

13674803 → ACNP

Volume

Issue

Year of publication

1999

Pages

391 - 412

Database

ISI

SICI code

1367-4803(199905)15:5<391:AGSAAA>2.0.ZU;2-M

Abstract

Motivation: Large-scale genome projects generate a rapidly increasing numbe r of sequences, most of them biochemically uncharacterized. Research in bio informatics contributes to the development of methods for the computational characterization of these sequences. However the installation and applicat ion of these methods require experience and are time consuming. Results: We present here an automatic system for preliminary functional ann otation of protein sequences that has been applied to the analysis of sets of sequences from complete genomes, both to refine overall performance and to make new discoveries comparable to those made by human experts. The Gene Quiz system includes a Web-based browser that allows examination of the evi dence leading to an automatic annotation and offers additional information, views of the results, and links to biological databases that complement th e automatic analysis. System structure and operating principles concerning the use of multiple sequence databases, underlying sequence analysis tools, lexical analyses of database annotations and decision criteria for functio nal assignments are detailed. The system makes automatic quality assessment s of results based on prior experience,with the underlying sequence analysi s tools, overall error rates in functional assignment are estimated at 2.5- 5% for cases annotated with highest reliability ('clear' cases), Sources of overinterpretation of results are discussed with proposals for improvement . A conservative definition for reporting 'new findings' thar rakes account of database maturity is presented along with examples of possible kinds of discover ies (new function, family and superfamily) made by the system. Sy stem performance in relation to sequence database coverage, database dynami cs and database search methods is analysed, demonstrating the inherent adva ntages of nn integrated automatic approach using multiple databases and sea rch methods applied in art objective and repeatable manner.