One of the first useful products From the human genome will be a set of pre
dicted genes. Besides its intrinsic scientific interest, the accuracy and c
ompleteness of this data set is of considerable importance for human health
and medicine. Though progress has been made on computational gene identifi
cation in terms of both methods and accuracy evaluation measures, most of t
he sequence sets in which the programs are tested are short genomic sequenc
es, and there is concern that these accuracy measures may not extrapolate w
ell to larger, more challenging data sets. Given the absence of experimenta
lly verified large genomic data sets, we constructed a semiartificial test
set comprising a number of short single-gene genomic sequences with randoml
y generated intergenic regions. This test set, which should still present a
n easier problem than real human genomic sequence, mimics the similar to 20
0kb long BACs being sequenced. In our experiments with these longer genomic
sequences, the accuracy of GENSCAN, one of the most accurate ab initio gen
e prediction programs, dropped significantly, although its sensitivity rema
ined high. Conversely, the accuracy of similarity-based programs, such as G
ENEWISE, PROCRUSTES, and BLASTX, was not affected significantly by the pres
ence of random intergenic sequence, but depended on the strength of the sim
ilarity to the protein homolog. As expected, the accuracy dropped if the mo
dels were built using more distant homologs, and we were able to quantitati
vely estimate this decline. However, the specificities of these techniques
are still rather good even when the similarity is weak, which is a desirabl
e characteristic For driving expensive Follow-up experiments. Our experimen
ts suggest that though gene prediction will improve with every new protein
that is discovered and through improvements in the current set of tools, we
still have a long way to go before we can decipher the precise exonic stru
cture of every gene in the human genome using purely computational methodol
ogy.