ITA
ENG

A STUDY OF N-GRAM AND DECISION TREE LETTER LANGUAGE MODELING METHODS

Authors

POTAMIANOS G JELINEK F

Citation

G. Potamianos et F. Jelinek, A STUDY OF N-GRAM AND DECISION TREE LETTER LANGUAGE MODELING METHODS, Speech communication, 24(3), 1998, pp. 171-192

Citations number

Categorie Soggetti

Communication,"Computer Science Interdisciplinary Applications","Computer Science Interdisciplinary Applications",Acoustics

Journal title

Speech communication → ACNP

ISSN journal

01676393

Volume

Issue

Year of publication

1998

Pages

171 - 192

Database

ISI

SICI code

0167-6393(1998)24:3<171:ASONAD>2.0.ZU;2-L

Abstract

The goal of this paper is to investigate various language model smooth ing techniques and decision tree based language model design algorithm s. For this purpose, we build language models for printable characters (letters), based on the Brown corpus. We consider two classes of mode ls for the text generation process: the n-gram language model and vari ous decision tree based language models. In the first part of the pape r, we compare the most popular smoothing algorithms applied to the for mer. We conclude that the bottom-up deleted interpolation algorithm pe rforms the best in the task of n-gram letter language model smoothing, significantly outperforming the back-off smoothing technique for larg e values of n. In the second part of the paper, we consider various de cision tree development algorithms. Among them, a K-means clustering t ype algorithm for the design of the decision tree questions gives the best results. However, the n-gram language model outperforms the decis ion tree language models for letter language modeling. We believe that this is due to the predictive nature of letter strings, which seems t o be naturally modeled by n-grams, (C) 1998 Elsevier Science B.V. All rights reserved.