Data driven computational biology relies on the large quantities of ge
nomic data stored in international sequence data banks, However, the p
ossibilities are drastically impaired if the stored data is unreliable
, During a project aiming to predict splice sites in the dicot Arabido
psis thaliana, we extracted a data set from the A.thaliana entries in
GenBank, A number of simple 'sanity' checks, based on the nature of th
e data, revealed an alarmingly high error rate, More than 15% of the m
ost important entries extracted did contain erroneous information, In
addition, a number of entries had directly conflicting assignments of
exons and introns, not stemming from alternative splicing, In a few ca
ses the errors are due to mere typographical misprints, which may be c
orrected by comparison to the original papers, but errors caused by wr
ong assignments of splice sites from experimental data are the most co
mmon, It is proposed that the level of error correction should be incr
eased and that gene structure sanity checks should be incorporated-als
o at the submitter level-to avoid or reduce the problem in the future.
A non-redundant and error corrected subset of the data for A.thaliana
is made available through anonymous FTP.