Ec. Thayer et al., Detection of protein coding sequences using a mixture model for local protein amino acid sequence, J COMPUT BI, 7(1-2), 2000, pp. 317-327
Locating protein coding regions in genomic DNA is a critical step in access
ing the information generated by large scale sequencing projects, Current m
ethods for gene detection depend on statistical measures of content differe
nces between coding and noncoding DNA in addition to the recognition of pro
moters, splice sites, and other regulatory sites. Here we explore the poten
tial value of recurrent amino acid sequence patterns 3-19 amino acids in le
ngth as a content statistic for use in gene finding approaches. A finite mi
xture model incorporating these patterns can partially discriminate protein
sequences which have no (detectable) known homologs from randomized versio
ns of these sequences, and from short (less than or equal to 50 amino acids
) non-coding segments extracted from the S. cerevisiea genome, The mixture
model derived scores for a collection of human exons were not correlated wi
th the GENSCAN scores, suggesting that the addition of our protein pattern
recognition module to current gene recognition programs may improve their p
erformance.