F. Jelinek, TRAINING AND SEARCH METHODS FOR SPEECH RECOGNITION, Proceedings of the National Academy of Sciences of the United Statesof America, 92(22), 1995, pp. 9964-9969
Speech recognition involves three professes: extraction of acoustic in
dices from the speech signal, estimation of the probability that the o
bserved index string was caused by a hypothesized utterance segment, a
nd determination of the recognized utterance via a search among hypoth
esized alternatives. This paper is not concerned with the first proces
s. Estimation of the probability of an index string involves a model o
f index production by any given utterance segment (e.g., a word), Hidd
en Markov models (HMMs) are used for this purpose [Makhoul, J. & Schwa
rtz, R. (1995) Proc. Natl. Acad. Sci. USA 92, 9956-9963]. Their parame
ters are state transition probabilities and output probability distrib
utions associated with the transitions. The Baum algorithm that obtain
s the values of these parameters from speech data via their successive
reestimation will be described in this paper, The recognizer wishes t
o find the most probable utterance that could have caused the observed
acoustic index string. That probability is the product of two factors
: the probability that the utterance will produce the string and the p
robability that the speaker will wish to produce the utterance (the la
nguage model probability), Even if the vocabulary size is moderate, it
is impossible to search for the utterance exhaustively, One practical
algorithm is described [Viterbi, A. J. (1967) IEEE Trans. inf. Theory
IT-13, 260-267] that, given the index string, has a high likelihood o
f finding the most probable utterance.