Global term weights for document retrieval learned from TREC data

Authors
Citation
Wj. Wilbur, Global term weights for document retrieval learned from TREC data, J INF SCI, 27(5), 2001, pp. 303-310
Citations number
24
Categorie Soggetti
Library & Information Science
Journal title
JOURNAL OF INFORMATION SCIENCE
ISSN journal
01655515 → ACNP
Volume
27
Issue
5
Year of publication
2001
Pages
303 - 310
Database
ISI
SICI code
0165-5515(2001)27:5<303:GTWFDR>2.0.ZU;2-6
Abstract
A key element in modern text retrieval systems is the weighting of individu al words for importance. Early in the development of document retrieval met hods it was recognized that performance could be improved if weights were b ased at least in part on the frequencies of individual terms in the databas e. This observation led investigators to propose inverse document frequency weighting, which has become the most commonly used approach. Inverse docum ent frequency weighting can be given some justification based on probabilis tic arguments. However, many different formulas have been tried and it is d ifficult to distinguish between these on a purely theoretical basis. Witten , Moffat and Bell, have proposed a monotonicity condition as fundamental: ' a term that appears in many documents should not be regarded as more import ant than a term that appears in a few'. Based on this monotonicity assumpti on and probabilistic arguments we show here how the TREC data can be used t o learn ideal global weights. Using cross-validation we show that these wei ghts are a modest but statistically significant improvement over IDF weight s. One conclusion is that IDF weights are close to optimal within the proba bilistic assumptions that are commonly made.