ITA
ENG

Kanji-to-hiragana conversion based on a length-constrained N-gram analysis

Authors

Picone, J Staples, T Kondo, K Arai, N

Citation

J. Picone et al., Kanji-to-hiragana conversion based on a length-constrained N-gram analysis, IEEE SPEECH, 7(6), 1999, pp. 685-696

Citations number

Categorie Soggetti

Eletrical & Eletronics Engineeing

Journal title

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING

ISSN journal

10636676 → ACNP

Volume

Issue

Year of publication

1999

Pages

685 - 696

Database

ISI

SICI code

1063-6676(199911)7:6<685:KCBOAL>2.0.ZU;2-F

Abstract

A common problem in speech processing is the conversion of the written form of a language to a set of phonetic symbols representing the pronunciation, In this paper, we focus on an aspect of this problem specific to the Japan ese language. Written Japanese consists of a mixture of three types of symb ols: kanji, hiragana, and katakana, We describe an algorithm for converting conventional Japanese orthography to a hiragana-like symbol set that close ly approximates the most common pronunciation of the test. The algorithm is based on two hypotheses: 1) the correct reading of a kanji character can b e determined by examining a small number of adjacent characters and 2) the number of such combinations required in a dictionary is manageable. The algorithm described here converts the input text by selecting the most probable sequence of orthographic units (n-grams) that can be concatenated to form the input text. In closed-set testing, the n-gram algorithm was sho wn to provide better performance than several public domain algorithms, ach ieving a sentence error rate of 3% on a wide range of text material. Though the focus of this paper is written Japanese, the pattern matching algorith m described here has applications to similar problems in other languages.