We describe a hidden Markov model, HMMSTR, for general protein sequence bas
ed on the I-sites library of sequence-structure motifs. Unlike the Linear h
idden Markov models used to model individual protein families, HMMSTR has a
highly branched topology and captures recurrent local features of protein
sequences and structures that transcend protein family boundaries. The mode
l extends the I-sites library by describing the adjacencies of different se
quence-structure motifs as observed in the protein database and, by represe
nting overlapping motifs in a much more compact form, achieves a great redu
ction in parameters. The HMM attributes a considerably higher probability t
o coding sequence than does an equivalent dipeptide model, predicts seconda
ry structure with an accuracy of 74.3 %, backbone torsion angles better tha
n any previously reported method and the structural context of beta strands
and turns with an accuracy that should be useful for tertiary structure pr
ediction. (C) 2000 Academic Press.