Motivation: Compositionally homogeneous segments of genomic DNA often corre
spond to meaningful biological units. Simple sliding window analysis is usu
ally insufficient for compositional segmentation of natural sequences. Hidd
en Markov models (HMM) with a small number of states are a natural language
for description of compositional properties of chromosome-size DNA sequenc
es.
Results: The algorithms were applied to yeast Saccharomyces cerevisiae chro
mosomes (YC) I, III, IV, VI and IX. The optimal number of HMM states is fou
nd to be four. The optimal four-state HMMs far all chromosomes are very sim
ilar; as well as the reconstructed segmentations. In most cases the models
with k + 1 states are obtained by 'splitting' one of the states in the mode
l with k states, and the corresponding increase of the level of detail in s
egmentation. The high AT states usually correspond to intergenic regions. W
e also explore the model's likelihood landscape and analyze the dynamics of
the optimization process, thus addressing the problem of reliability of th
e obtained optima and efficiency of the algorithms.