Detection of protein coding sequences using a mixture model for local protein amino acid sequence

Citation
Ec. Thayer et al., Detection of protein coding sequences using a mixture model for local protein amino acid sequence, J COMPUT BI, 7(1-2), 2000, pp. 317-327
Citations number
30
Categorie Soggetti
Biochemistry & Biophysics
Journal title
JOURNAL OF COMPUTATIONAL BIOLOGY
ISSN journal
10665277 → ACNP
Volume
7
Issue
1-2
Year of publication
2000
Pages
317 - 327
Database
ISI
SICI code
1066-5277(200002/04)7:1-2<317:DOPCSU>2.0.ZU;2-Z
Abstract
Locating protein coding regions in genomic DNA is a critical step in access ing the information generated by large scale sequencing projects, Current m ethods for gene detection depend on statistical measures of content differe nces between coding and noncoding DNA in addition to the recognition of pro moters, splice sites, and other regulatory sites. Here we explore the poten tial value of recurrent amino acid sequence patterns 3-19 amino acids in le ngth as a content statistic for use in gene finding approaches. A finite mi xture model incorporating these patterns can partially discriminate protein sequences which have no (detectable) known homologs from randomized versio ns of these sequences, and from short (less than or equal to 50 amino acids ) non-coding segments extracted from the S. cerevisiea genome, The mixture model derived scores for a collection of human exons were not correlated wi th the GENSCAN scores, suggesting that the addition of our protein pattern recognition module to current gene recognition programs may improve their p erformance.