Improving continuous speech recognition in Spanish by phone-class semicontinuous HMMs with pausing and multiple pronunciations

Citation
J. Ferreiros et Jm. Pardo, Improving continuous speech recognition in Spanish by phone-class semicontinuous HMMs with pausing and multiple pronunciations, SPEECH COMM, 29(1), 1999, pp. 65-76
Citations number
18
Categorie Soggetti
Computer Science & Engineering
Journal title
SPEECH COMMUNICATION
ISSN journal
01676393 → ACNP
Volume
29
Issue
1
Year of publication
1999
Pages
65 - 76
Database
ISI
SICI code
0167-6393(199909)29:1<65:ICSRIS>2.0.ZU;2-Y
Abstract
This paper presents a comprehensive study of continuous speech recognition in Spanish. It shows the use and optimisation of several well-known techniq ues together with the application for the first time to Spanish of language specific knowledge to these systems, i.e. the careful selection of the pho ne inventory, the phone-classes used, and the selection of alternative pron unciation rules. We have developed a semicontinuous phone-class dependent c ontextual modelling. Using four phone-classes, we have obtained recognition error rate reductions roughly equivalent to the percentage increase of the number of parameters, compared to baseline semicontinuous contextual model ling. We also show that the use of pausing in the training system and multi ple pronunciations in the vocabulary help to improve recognition rates sign ificantly. The actual pausing of the training sentences and the application of assimilation effects improve the transcription into context-dependent u nits. Multiple pronunciation possibilities are generated using general rule s that are easily applied to any Spanish vocabulary. With all these ideas w e have reduced the recognition errors of the baseline system by more than 3 0% in a task parallel to DARPA-RM translated into Spanish with a vocabulary of 979 words. Our database contains four speakers with 600 training senten ces and 100 testing sentences each. All experiments have been carried out w ith a perplexity of 979, and even slightly higher in the case of multiple p ronunciations, to be able to study the acoustic modelling power of the syst ems with no grammar constraints. (C) 1999 Elsevier Science B.V. All rights reserved.