Robust speech detection method for telephone speech recognition system

Citation
S. Kuroiwa et al., Robust speech detection method for telephone speech recognition system, SPEECH COMM, 27(2), 1999, pp. 135-148
Citations number
27
Categorie Soggetti
Computer Science & Engineering
Journal title
SPEECH COMMUNICATION
ISSN journal
01676393 → ACNP
Volume
27
Issue
2
Year of publication
1999
Pages
135 - 148
Database
ISI
SICI code
0167-6393(199903)27:2<135:RSDMFT>2.0.ZU;2-X
Abstract
This paper describes speech endpoint detection methods for continuous speec h recognition systems used over telephone networks. Speech input to these s ystems may be contaminated not only by various ambient noises but also by v arious irrelevant sounds generated by users such as coughs, tongue clicking , lip noises and certain out-of-task utterances. Under these adverse condit ions, robust speech endpoint detection remains an unsolved problem. We foun d in fact, that speech endpoint detection errors occurred in over 10% of th e inputs in field trials of a voice activated telephone extension system. T hese errors were caused by problems of (1) low SNR, (2) long pauses between phrases and (3) irrelevant sounds prior to task sentences. To solve the fi rst two problems, we propose a real-time speech ending point detection algo rithm based on the implicit approach, which finds a sentence end by compari ng the likelihood of a complete sentence hypothesis and other hypotheses. F or the third problem, we propose a speech beginning point detection algorit hm which rejects irrelevant sounds by using likelihood ratio and duration c onditions. The effectiveness of these methods was evaluated under various c onditions. As a result, we found that the ending point detection algorithm was not affected by long pauses and that the beginning point detection algo rithm successfully rejected irrelevant sounds by using phone HMMs that fit the task. Furthermore, a garbage model of irrelevant sounds was also evalua ted and we found that the garbage modeling technique and the proposed metho d compensated each other in their respective weak points and that the best recognition accuracy was achieved by integrating these methods. (C) 1999 El sevier Science B.V. All rights reserved.