Database verification studies of SWISS-PROT and GenBank

Citation
Pd. Karp et al., Database verification studies of SWISS-PROT and GenBank, BIOINFORMAT, 17(6), 2001, pp. 526-532
Citations number
9
Categorie Soggetti
Multidisciplinary
Journal title
BIOINFORMATICS
ISSN journal
13674803 → ACNP
Volume
17
Issue
6
Year of publication
2001
Pages
526 - 532
Database
ISI
SICI code
1367-4803(200106)17:6<526:DVSOSA>2.0.ZU;2-I
Abstract
Problem statement: We have studied the relationships among SWISS-PROT, TrEM BL, and GenBank with two goals. First is to determine whether users can rel iably identify those proteins in SWISS-PROT whose functions were determined experimentally, as opposed to proteins whose functions were predicted comp utationally. If this information was present in reasonable quantities, it w ould allow researchers to decrease the propagation of incorrect function pr edictions during sequence annotation, and to assemble training sets for dev eloping the next generation of sequence-analysis algorithms. Second is to a ssess the consistency between translated GenBank sequences and sequences in SWISS-PROT and TrEMBL, Results: (1) Contrary to claims by the SWISS-PROT authors, we conclude that SWISS-PROT does not identify a significant number of experimentally charac terized proteins. (2) SWISS-PROT is more incomplete than we expected in tha t version 38.0 from July 1999 lacks many proteins from the full genomes of important organisms that were sequenced years earlier. (3) Even if we combi ne SWISS-PROT and TrEMBL, some sequences from the full genomes are missing from the combined dataset. (4) In many cases, translated GenBank genes do n ot exactly match the corresponding SWISS-PROT sequences, for reasons that i nclude missing or removed methionines, differing translation start position s, individual amino-acid differences, and inclusion of sequence data from m ultiple sequencing projects. For example, results show that for Escherichia coli, 80.6% of the proteins in the GenBank entry for the complete genome h ave identical sequence matches with SWISS-PROT/TrEMBL sequences, 13.4% have exact substring matches, and matches for 4.1% can be found using BLAST sea rch; the remaining 2.0% of E.coli protein sequences (most of which are ORFs ) have no clear matches to SWISS-PROT/TrEMBL. Although many of these differ ences can be explained by the complexity of the DB, and by the curation pro cesses used to create it, the scale of the differences is notable.