Motivation: Large-scale genome projects generate a rapidly increasing numbe
r of sequences, most of them biochemically uncharacterized. Research in bio
informatics contributes to the development of methods for the computational
characterization of these sequences. However the installation and applicat
ion of these methods require experience and are time consuming.
Results: We present here an automatic system for preliminary functional ann
otation of protein sequences that has been applied to the analysis of sets
of sequences from complete genomes, both to refine overall performance and
to make new discoveries comparable to those made by human experts. The Gene
Quiz system includes a Web-based browser that allows examination of the evi
dence leading to an automatic annotation and offers additional information,
views of the results, and links to biological databases that complement th
e automatic analysis. System structure and operating principles concerning
the use of multiple sequence databases, underlying sequence analysis tools,
lexical analyses of database annotations and decision criteria for functio
nal assignments are detailed. The system makes automatic quality assessment
s of results based on prior experience,with the underlying sequence analysi
s tools, overall error rates in functional assignment are estimated at 2.5-
5% for cases annotated with highest reliability ('clear' cases), Sources of
overinterpretation of results are discussed with proposals for improvement
. A conservative definition for reporting 'new findings' thar rakes account
of database maturity is presented along with examples of possible kinds of
discover ies (new function, family and superfamily) made by the system. Sy
stem performance in relation to sequence database coverage, database dynami
cs and database search methods is analysed, demonstrating the inherent adva
ntages of nn integrated automatic approach using multiple databases and sea
rch methods applied in art objective and repeatable manner.