The EcoGene project involves the examination of Escherichia coli K-12
DNA sequences and accompanying annotation in the public databases in o
rder to refine the representation and prediction of the entire set of
E. coli K-12 chromosomally encoded protein sequences. The results of t
his ongoing effort have been deposited in the SWISSPROT protein sequen
ce database as sequencing of the E. coli genome has progressed to comp
letion in recent years. Through this continuing research, we have disc
overed that the prediction of low molecular weight (small) proteins, a
rbitrarily defined as protein sequences less than or equal to 150 amin
o acids (aa) in length, is problematic and requires special attention.
We describe the small protein subset of EcoGene and the approach used
to derive this subset from the complete E. coli genome sequence and d
atabase annotations. These E. coli proteins have helped to identify ne
w small genes in other organisms and to identify conserved residues (m
otifs) using database searches and multiple alignments. Two thirds of
the E. coli small proteins have not been characterized experimentally.
The careful application of computer and laboratory methods to the ana
lysis of small proteins is needed for accurate prediction, verificatio
n and characterization. The problem of accurate protein sequence ident
ification is not limited to small proteins or to E. coli; these proble
ms are encountered to varying degrees throughout all sequence database
s.