ITA
ENG

Similarity-based models of word cooccurrence probabilities

Authors

Dagan, I Lee, L Pereira, FCN

Citation

I. Dagan et al., Similarity-based models of word cooccurrence probabilities, MACH LEARN, 34(1-3), 1999, pp. 43-69

Citations number

Categorie Soggetti

AI Robotics and Automatic Control

Journal title

MACHINE LEARNING

ISSN journal

08856125 → ACNP

Volume

Issue

1-3

Year of publication

1999

Pages

43 - 69

Database

ISI

SICI code

0885-6125(199902)34:1-3<43:SMOWCP>2.0.ZU;2-F

Abstract

In many applications of natural language processing (NLP) it is necessary t o determine the likelihood of a given word combination. For example, a spee ch recognizer may need to determine which of the two word combinations "eat a peach" and "eat a peach" is more likely. Statistical NLP methods determi ne the likelihood of a word combination from its Frequency in a training co rpus. However. the nature of language is such that many word combinations a re infrequent and do not occur in any given corpus. In this work we propose a method for estimating the probability of such previously unseen word com binations using available information on "most similar" words. We describe probabilistic word association models based on distributional w ord similarity, and apply them to two tasks, language modeling and pseudo-w ord disambiguation. In the language modeling task, a similarity-based model is used to improve probability estimates for unseen bigrams in a back-off language model. The similarity-based method yields a 20% perplexity improve ment in the prediction of unseen bigrams and statistically significant redu ctions in speech-recognition error. We also compare four similarity-based estimation methods against back-off a nd maximum-likelihood estimation methods on a pseudo-word sense disambiguat ion task in which we controlled for both unigram and bigram frequency to av oid giving too much weight to easy-to-disambiguate high-frequency configura tions. The similarity-based methods perform up to 40% better on this partic ular task.