We have performed a systematic analysis of gene identification in geno
mic sequence by similarity search against expressed sequence tags (EST
s) to assess the suitability of this method for automated annotation o
f the human genome. A BLAST-based strategy was constructed to examine
the potential of this approach, and was applied to test sets containin
g all human genomic sequences longer than 5 kb in public databases, pl
us 300 kb of exhaustively characterized benchmark sequence. At high st
ringency, 70%-90% of all annotated genes are detected by near-identity
to EST sequence; >95% of ESTs aligning with well-annotated sequences
overlap a gene. These ESTs provide immediate access to the correspondi
ng cDNA clones for follow-Lip laboratory verification and subsequent b
iologic analysis. At lower stringency, up to 97% of annotated genes we
re identified by similarity to ESTs. The apparent false-positive rate
rose to 55% to ESTs among all sequences and 20% among benchmark sequen
ces at the lowest stringency, indicating that many genes in public dat
abase entries are unannotated. Approximately half of the alignments sp
an multiple exons, and thus aid in the construction of gene prediction
s and elucidation of alternative splicing. In addition, ESTs from mult
iple cDNA libraries frequently cluster over genes, providing a startin
g point For crude expression profiles. Clone IDs may be used to form E
ST pairs, and particularly to extend models by associating alignments
of lower stringency with high-quality alignments. These results demons
trate that EST similarity search is a practical general-purpose annota
tion technique that complements pattern recognition methods as a tool
for gene characterization.