CLEANING THE GENBANK ARABIDOPSIS-THALIANA DATA SET

Citation
Pg. Korning et al., CLEANING THE GENBANK ARABIDOPSIS-THALIANA DATA SET, Nucleic acids research, 24(2), 1996, pp. 316-320
Citations number
15
Categorie Soggetti
Biology
Journal title
ISSN journal
03051048
Volume
24
Issue
2
Year of publication
1996
Pages
316 - 320
Database
ISI
SICI code
0305-1048(1996)24:2<316:CTGADS>2.0.ZU;2-U
Abstract
Data driven computational biology relies on the large quantities of ge nomic data stored in international sequence data banks, However, the p ossibilities are drastically impaired if the stored data is unreliable , During a project aiming to predict splice sites in the dicot Arabido psis thaliana, we extracted a data set from the A.thaliana entries in GenBank, A number of simple 'sanity' checks, based on the nature of th e data, revealed an alarmingly high error rate, More than 15% of the m ost important entries extracted did contain erroneous information, In addition, a number of entries had directly conflicting assignments of exons and introns, not stemming from alternative splicing, In a few ca ses the errors are due to mere typographical misprints, which may be c orrected by comparison to the original papers, but errors caused by wr ong assignments of splice sites from experimental data are the most co mmon, It is proposed that the level of error correction should be incr eased and that gene structure sanity checks should be incorporated-als o at the submitter level-to avoid or reduce the problem in the future. A non-redundant and error corrected subset of the data for A.thaliana is made available through anonymous FTP.