Automated genome sequence analysis and annotation

Citation
Ma. Andrade et al., Automated genome sequence analysis and annotation, BIOINFORMAT, 15(5), 1999, pp. 391-412
Citations number
79
Categorie Soggetti
Multidisciplinary
Journal title
BIOINFORMATICS
ISSN journal
13674803 → ACNP
Volume
15
Issue
5
Year of publication
1999
Pages
391 - 412
Database
ISI
SICI code
1367-4803(199905)15:5<391:AGSAAA>2.0.ZU;2-M
Abstract
Motivation: Large-scale genome projects generate a rapidly increasing numbe r of sequences, most of them biochemically uncharacterized. Research in bio informatics contributes to the development of methods for the computational characterization of these sequences. However the installation and applicat ion of these methods require experience and are time consuming. Results: We present here an automatic system for preliminary functional ann otation of protein sequences that has been applied to the analysis of sets of sequences from complete genomes, both to refine overall performance and to make new discoveries comparable to those made by human experts. The Gene Quiz system includes a Web-based browser that allows examination of the evi dence leading to an automatic annotation and offers additional information, views of the results, and links to biological databases that complement th e automatic analysis. System structure and operating principles concerning the use of multiple sequence databases, underlying sequence analysis tools, lexical analyses of database annotations and decision criteria for functio nal assignments are detailed. The system makes automatic quality assessment s of results based on prior experience,with the underlying sequence analysi s tools, overall error rates in functional assignment are estimated at 2.5- 5% for cases annotated with highest reliability ('clear' cases), Sources of overinterpretation of results are discussed with proposals for improvement . A conservative definition for reporting 'new findings' thar rakes account of database maturity is presented along with examples of possible kinds of discover ies (new function, family and superfamily) made by the system. Sy stem performance in relation to sequence database coverage, database dynami cs and database search methods is analysed, demonstrating the inherent adva ntages of nn integrated automatic approach using multiple databases and sea rch methods applied in art objective and repeatable manner.