Problem statement: We have studied the relationships among SWISS-PROT, TrEM
BL, and GenBank with two goals. First is to determine whether users can rel
iably identify those proteins in SWISS-PROT whose functions were determined
experimentally, as opposed to proteins whose functions were predicted comp
utationally. If this information was present in reasonable quantities, it w
ould allow researchers to decrease the propagation of incorrect function pr
edictions during sequence annotation, and to assemble training sets for dev
eloping the next generation of sequence-analysis algorithms. Second is to a
ssess the consistency between translated GenBank sequences and sequences in
SWISS-PROT and TrEMBL,
Results: (1) Contrary to claims by the SWISS-PROT authors, we conclude that
SWISS-PROT does not identify a significant number of experimentally charac
terized proteins. (2) SWISS-PROT is more incomplete than we expected in tha
t version 38.0 from July 1999 lacks many proteins from the full genomes of
important organisms that were sequenced years earlier. (3) Even if we combi
ne SWISS-PROT and TrEMBL, some sequences from the full genomes are missing
from the combined dataset. (4) In many cases, translated GenBank genes do n
ot exactly match the corresponding SWISS-PROT sequences, for reasons that i
nclude missing or removed methionines, differing translation start position
s, individual amino-acid differences, and inclusion of sequence data from m
ultiple sequencing projects. For example, results show that for Escherichia
coli, 80.6% of the proteins in the GenBank entry for the complete genome h
ave identical sequence matches with SWISS-PROT/TrEMBL sequences, 13.4% have
exact substring matches, and matches for 4.1% can be found using BLAST sea
rch; the remaining 2.0% of E.coli protein sequences (most of which are ORFs
) have no clear matches to SWISS-PROT/TrEMBL. Although many of these differ
ences can be explained by the complexity of the DB, and by the curation pro
cesses used to create it, the scale of the differences is notable.