As increasing amounts of genomic sequence from many organisms become a
vailable, and as DNA sequences become a primary reagent in biologic in
vestigations, the role of annotation as a prospective guide for labora
tory experiments will expand rapidly. Here we describe a process of hi
gh-throughput, reliable annotation, called Framework annotation, which
is designed to provide a foundation for initial biologic characteriza
tion of previously unexamined sequence. To examine this concept in pra
ctice, we have constructed Genome Annotation and Information Analysis
(GAIA), a prototype software architecture that implements several elem
ents important for framework annotation. The center of GAIA consists o
f an annotation database and the associated data management subsystem
that forms the software bus along which other components communicate.
The schema for this database defines three principal concepts: (I) Ent
ries, consisting of sequence and associated historical data; (2) Featu
res, comprising information of biologic interest; and (3) Experiments,
describing the evidence that supports Features. The database permits
tracking of annotation results over time, as well as assessment of the
reliability of particular results. New framework annotation is produc
ed by CARTA, a set of autonomous sensors that perform automatic analys
es and assert results into the annotation database. These results are
available via a Web-based query interface that uses graphical lava app
lets as well as text-based HTML pages to display data at different lev
els of resolution and permit interactive exploration of annotation. We
present results for initial application of framework annotation to a
set of test sequences, demonstrating its effectiveness in providing a
starting point for biologic investigation, and discuss ways in which t
he current prototype can be improved. The prototype is available for p
ublic use and comment at http://www.cbil.upenn.edu/gaia.