INTRINSIC AND EXTRINSIC APPROACHES FOR DETECTING GENES IN A BACTERIALGENOME

Citation
M. Borodovsky et al., INTRINSIC AND EXTRINSIC APPROACHES FOR DETECTING GENES IN A BACTERIALGENOME, Nucleic acids research, 22(22), 1994, pp. 4756-4767
Citations number
64
Categorie Soggetti
Biology
Journal title
ISSN journal
03051048
Volume
22
Issue
22
Year of publication
1994
Pages
4756 - 4767
Database
ISI
SICI code
0305-1048(1994)22:22<4756:IAEAFD>2.0.ZU;2-O
Abstract
The unannotated regions of the Escherichia coli genome DNA sequence fr om the EcoSeq6 database, totaling 1,278 'intergenic' sequences of the combined length of 359,279 basepairs, were analyzed using computer-ass isted methods with the aim of identifying putative unknown genes. The proposed strategy for finding new genes includes two key elements: i) prediction of expressed open reading frames (ORFs) using the GeneMark method based on Markov chain models for coding and non-coding regions of Escherichia coli DNA, and ii) search for protein sequence similarit ies using programs based on the BLAST algorithm and programs for motif identification. A total of 354 putative expressed ORFs were predicted by GeneMark. Using the BLASTX and TBLASTN programs, it was shown that 208 ORFs located in the unannotated regions of the E.coli chromosome are significantly similar to other protein sequences. Identification o f 182 ORFs as probable genes was supported by both GeneMark and BLAST, comprising 51.4% of the GeneMark 'hits' and 87.5% of the BLAST 'hits' . 73 putative new genes, comprising 20.6% of the GeneMark predictions, belong to ancient conserved protein families that include both eubact erial and eukaryotic members. This value is close to the overall propo rtion of highly conserved sequences among eubacterial proteins, indica ting that the majority of the putative expressed ORFs that are predict ed by GeneMark, but have no significant BLAST hits, nevertheless are l ikely to be real genes. The majority of the putative genes identified by BLAST search have been described since the release of the EcoSeq6 d atabase, but about 70 genes have not been detected so far. Among these new identifications are genes encoding proteins with a variety of pre dicted functions including dehydrogenases, kinases, several other meta bolic enzymes, ATPases, rRNA methyltransferases, membrane proteins, an d different types of regulatory proteins.