The identification of genes in newly determined vertebrate genomic seq
uences can range from a trivial to an impossible task. In a statistica
l preamble, we show how ''insignificant'' are the individual features
on which gene identification can be rigorously based: promoter signals
, splice sites, open reading frames, etc. The practical identification
of genes is thus ultimately a tributary of their resemblance to those
already present in sequence databases, or incorporated into training
sets. The inherent conservatism of the currently popular methods (data
base similarity search, GRAIL) will greatly limit our capacity for mak
ing unexpected biological discoveries from increasingly abundant genom
ic data. Beyond a very limited subset of trivial cases, the automated
interpretation (i.e. without experimental validation) of genomic data,
is still a myth. On the other hand, characterizing the 60 000 to 100
000 genes thought to be hidden in the human genome by the mean of indi
vidual experiments is not feasible. Thus, it appears that our only hop
e of turning genome data into genome information must rely on drastic
progresses in the way we identify and analyse genes in silico. (C) 199
7 Elsevier Science Ltd.