ITA
ENG

Intonation and dialog context as constraints for speech recognition

Authors

Taylor, P King, S Isard, S Wright, H

Citation

P. Taylor et al., Intonation and dialog context as constraints for speech recognition, LANG SPEECH, 41, 1998, pp. 493-512

Citations number

Categorie Soggetti

Sociology & Antropology

Journal title

LANGUAGE AND SPEECH

ISSN journal

00238309 → ACNP

Volume

Year of publication

1998

Part

3-4

Pages

493 - 512

Database

ISI

SICI code

0023-8309(199807/12)41:<493:IADCAC>2.0.ZU;2-7

Abstract

This paper describes a way of using intonation and dialog context to improv e the performance of an automatic speech recognition (ASR) system. Our expe riments were run on the DCIEM Maptask corpus, a corpus of spontaneous task- oriented dialog speech. This corpus has been tagged according to a dialog a nalysis scheme that assigns each utterance to one of 12 "move types," such as "acknowledge:" "query-yes/no" or "instruct." Most ASR systems use a bigr am language model to constrain the possible sequences of words that might b e recognized. Here we use a separate bigram language model for each move ty pe. We show that when the "correct" move specific language model is used fo r each utterance in the test set, the word error rate of the recognizer dro ps. Of course when the recognizer is run on previously unseen data, it cannot k now in advance what move type the speaker has just produced. To determine t he move type we use an intonation model combined with a dialog model that p uts constraints on possible sequences of move types, as well as the speech recognizer likelihoods for the different move-specific models. In the full recognition system, the combination of automatic move type recognition with the move specific language models reduces the overall word error rate by a small but significant amount when compared with a baseline system that doe s not take intonation or dialog acts into account. Interestingly, the word error improvement is restricted to "initiating" move types, where word reco gnition is important. In "response" move types, where the important informa tion is conveyed by the move type itself-for example, positive versus negat ive response - there is no word error improvement, but recognition of the r esponse types themselves is good. The paper discusses the intonation model, the language models, and the dialog model in detail and describes the arch itecture in which they are combined.