Kanji-to-hiragana conversion based on a length-constrained N-gram analysis

Citation
J. Picone et al., Kanji-to-hiragana conversion based on a length-constrained N-gram analysis, IEEE SPEECH, 7(6), 1999, pp. 685-696
Citations number
25
Categorie Soggetti
Eletrical & Eletronics Engineeing
Journal title
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING
ISSN journal
10636676 → ACNP
Volume
7
Issue
6
Year of publication
1999
Pages
685 - 696
Database
ISI
SICI code
1063-6676(199911)7:6<685:KCBOAL>2.0.ZU;2-F
Abstract
A common problem in speech processing is the conversion of the written form of a language to a set of phonetic symbols representing the pronunciation, In this paper, we focus on an aspect of this problem specific to the Japan ese language. Written Japanese consists of a mixture of three types of symb ols: kanji, hiragana, and katakana, We describe an algorithm for converting conventional Japanese orthography to a hiragana-like symbol set that close ly approximates the most common pronunciation of the test. The algorithm is based on two hypotheses: 1) the correct reading of a kanji character can b e determined by examining a small number of adjacent characters and 2) the number of such combinations required in a dictionary is manageable. The algorithm described here converts the input text by selecting the most probable sequence of orthographic units (n-grams) that can be concatenated to form the input text. In closed-set testing, the n-gram algorithm was sho wn to provide better performance than several public domain algorithms, ach ieving a sentence error rate of 3% on a wide range of text material. Though the focus of this paper is written Japanese, the pattern matching algorith m described here has applications to similar problems in other languages.