The unannotated regions of the Escherichia coli genome DNA sequence fr
om the EcoSeq6 database, totaling 1,278 'intergenic' sequences of the
combined length of 359,279 basepairs, were analyzed using computer-ass
isted methods with the aim of identifying putative unknown genes. The
proposed strategy for finding new genes includes two key elements: i)
prediction of expressed open reading frames (ORFs) using the GeneMark
method based on Markov chain models for coding and non-coding regions
of Escherichia coli DNA, and ii) search for protein sequence similarit
ies using programs based on the BLAST algorithm and programs for motif
identification. A total of 354 putative expressed ORFs were predicted
by GeneMark. Using the BLASTX and TBLASTN programs, it was shown that
208 ORFs located in the unannotated regions of the E.coli chromosome
are significantly similar to other protein sequences. Identification o
f 182 ORFs as probable genes was supported by both GeneMark and BLAST,
comprising 51.4% of the GeneMark 'hits' and 87.5% of the BLAST 'hits'
. 73 putative new genes, comprising 20.6% of the GeneMark predictions,
belong to ancient conserved protein families that include both eubact
erial and eukaryotic members. This value is close to the overall propo
rtion of highly conserved sequences among eubacterial proteins, indica
ting that the majority of the putative expressed ORFs that are predict
ed by GeneMark, but have no significant BLAST hits, nevertheless are l
ikely to be real genes. The majority of the putative genes identified
by BLAST search have been described since the release of the EcoSeq6 d
atabase, but about 70 genes have not been detected so far. Among these
new identifications are genes encoding proteins with a variety of pre
dicted functions including dehydrogenases, kinases, several other meta
bolic enzymes, ATPases, rRNA methyltransferases, membrane proteins, an
d different types of regulatory proteins.