ITA
ENG

CLEANING THE GENBANK ARABIDOPSIS-THALIANA DATA SET

Authors

KORNING PG HEBSGAARD SM ROUZE P BRUNAK S

Citation

Pg. Korning et al., CLEANING THE GENBANK ARABIDOPSIS-THALIANA DATA SET, Nucleic acids research, 24(2), 1996, pp. 316-320

Citations number

Categorie Soggetti

Biology

Journal title

Nucleic acids research → ACNP

ISSN journal

03051048

Volume

Issue

Year of publication

1996

Pages

316 - 320

Database

ISI

SICI code

0305-1048(1996)24:2<316:CTGADS>2.0.ZU;2-U

Abstract

Data driven computational biology relies on the large quantities of ge nomic data stored in international sequence data banks, However, the p ossibilities are drastically impaired if the stored data is unreliable , During a project aiming to predict splice sites in the dicot Arabido psis thaliana, we extracted a data set from the A.thaliana entries in GenBank, A number of simple 'sanity' checks, based on the nature of th e data, revealed an alarmingly high error rate, More than 15% of the m ost important entries extracted did contain erroneous information, In addition, a number of entries had directly conflicting assignments of exons and introns, not stemming from alternative splicing, In a few ca ses the errors are due to mere typographical misprints, which may be c orrected by comparison to the original papers, but errors caused by wr ong assignments of splice sites from experimental data are the most co mmon, It is proposed that the level of error correction should be incr eased and that gene structure sanity checks should be incorporated-als o at the submitter level-to avoid or reduce the problem in the future. A non-redundant and error corrected subset of the data for A.thaliana is made available through anonymous FTP.