Similarity-based models of word cooccurrence probabilities

Citation
I. Dagan et al., Similarity-based models of word cooccurrence probabilities, MACH LEARN, 34(1-3), 1999, pp. 43-69
Citations number
56
Categorie Soggetti
AI Robotics and Automatic Control
Journal title
MACHINE LEARNING
ISSN journal
08856125 → ACNP
Volume
34
Issue
1-3
Year of publication
1999
Pages
43 - 69
Database
ISI
SICI code
0885-6125(199902)34:1-3<43:SMOWCP>2.0.ZU;2-F
Abstract
In many applications of natural language processing (NLP) it is necessary t o determine the likelihood of a given word combination. For example, a spee ch recognizer may need to determine which of the two word combinations "eat a peach" and "eat a peach" is more likely. Statistical NLP methods determi ne the likelihood of a word combination from its Frequency in a training co rpus. However. the nature of language is such that many word combinations a re infrequent and do not occur in any given corpus. In this work we propose a method for estimating the probability of such previously unseen word com binations using available information on "most similar" words. We describe probabilistic word association models based on distributional w ord similarity, and apply them to two tasks, language modeling and pseudo-w ord disambiguation. In the language modeling task, a similarity-based model is used to improve probability estimates for unseen bigrams in a back-off language model. The similarity-based method yields a 20% perplexity improve ment in the prediction of unseen bigrams and statistically significant redu ctions in speech-recognition error. We also compare four similarity-based estimation methods against back-off a nd maximum-likelihood estimation methods on a pseudo-word sense disambiguat ion task in which we controlled for both unigram and bigram frequency to av oid giving too much weight to easy-to-disambiguate high-frequency configura tions. The similarity-based methods perform up to 40% better on this partic ular task.