A key element in modern text retrieval systems is the weighting of individu
al words for importance. Early in the development of document retrieval met
hods it was recognized that performance could be improved if weights were b
ased at least in part on the frequencies of individual terms in the databas
e. This observation led investigators to propose inverse document frequency
weighting, which has become the most commonly used approach. Inverse docum
ent frequency weighting can be given some justification based on probabilis
tic arguments. However, many different formulas have been tried and it is d
ifficult to distinguish between these on a purely theoretical basis. Witten
, Moffat and Bell, have proposed a monotonicity condition as fundamental: '
a term that appears in many documents should not be regarded as more import
ant than a term that appears in a few'. Based on this monotonicity assumpti
on and probabilistic arguments we show here how the TREC data can be used t
o learn ideal global weights. Using cross-validation we show that these wei
ghts are a modest but statistically significant improvement over IDF weight
s. One conclusion is that IDF weights are close to optimal within the proba
bilistic assumptions that are commonly made.