A common problem in speech processing is the conversion of the written form
of a language to a set of phonetic symbols representing the pronunciation,
In this paper, we focus on an aspect of this problem specific to the Japan
ese language. Written Japanese consists of a mixture of three types of symb
ols: kanji, hiragana, and katakana, We describe an algorithm for converting
conventional Japanese orthography to a hiragana-like symbol set that close
ly approximates the most common pronunciation of the test. The algorithm is
based on two hypotheses: 1) the correct reading of a kanji character can b
e determined by examining a small number of adjacent characters and 2) the
number of such combinations required in a dictionary is manageable.
The algorithm described here converts the input text by selecting the most
probable sequence of orthographic units (n-grams) that can be concatenated
to form the input text. In closed-set testing, the n-gram algorithm was sho
wn to provide better performance than several public domain algorithms, ach
ieving a sentence error rate of 3% on a wide range of text material. Though
the focus of this paper is written Japanese, the pattern matching algorith
m described here has applications to similar problems in other languages.