ITA
ENG

Global term weights for document retrieval learned from TREC data

Authors

Wilbur, WJ

Citation

Wj. Wilbur, Global term weights for document retrieval learned from TREC data, J INF SCI, 27(5), 2001, pp. 303-310

Citations number

Categorie Soggetti

Library & Information Science

Journal title

JOURNAL OF INFORMATION SCIENCE

ISSN journal

01655515 → ACNP

Volume

Issue

Year of publication

2001

Pages

303 - 310

Database

ISI

SICI code

0165-5515(2001)27:5<303:GTWFDR>2.0.ZU;2-6

Abstract

A key element in modern text retrieval systems is the weighting of individu al words for importance. Early in the development of document retrieval met hods it was recognized that performance could be improved if weights were b ased at least in part on the frequencies of individual terms in the databas e. This observation led investigators to propose inverse document frequency weighting, which has become the most commonly used approach. Inverse docum ent frequency weighting can be given some justification based on probabilis tic arguments. However, many different formulas have been tried and it is d ifficult to distinguish between these on a purely theoretical basis. Witten , Moffat and Bell, have proposed a monotonicity condition as fundamental: ' a term that appears in many documents should not be regarded as more import ant than a term that appears in a few'. Based on this monotonicity assumpti on and probabilistic arguments we show here how the TREC data can be used t o learn ideal global weights. Using cross-validation we show that these wei ghts are a modest but statistically significant improvement over IDF weight s. One conclusion is that IDF weights are close to optimal within the proba bilistic assumptions that are commonly made.