In many applications of natural language processing (NLP) it is necessary t
o determine the likelihood of a given word combination. For example, a spee
ch recognizer may need to determine which of the two word combinations "eat
a peach" and "eat a peach" is more likely. Statistical NLP methods determi
ne the likelihood of a word combination from its Frequency in a training co
rpus. However. the nature of language is such that many word combinations a
re infrequent and do not occur in any given corpus. In this work we propose
a method for estimating the probability of such previously unseen word com
binations using available information on "most similar" words.
We describe probabilistic word association models based on distributional w
ord similarity, and apply them to two tasks, language modeling and pseudo-w
ord disambiguation. In the language modeling task, a similarity-based model
is used to improve probability estimates for unseen bigrams in a back-off
language model. The similarity-based method yields a 20% perplexity improve
ment in the prediction of unseen bigrams and statistically significant redu
ctions in speech-recognition error.
We also compare four similarity-based estimation methods against back-off a
nd maximum-likelihood estimation methods on a pseudo-word sense disambiguat
ion task in which we controlled for both unigram and bigram frequency to av
oid giving too much weight to easy-to-disambiguate high-frequency configura
tions. The similarity-based methods perform up to 40% better on this partic
ular task.