Test collections have traditionally been used by information retrieval rese
archers to improve their retrieval strategies. To be viable as a laboratory
tool, a collection must reliably rank different retrieval variants accordi
ng to their true effectiveness. In particular, the relative effectiveness o
f two retrieval strategies should be insensitive to modest changes in the r
elevant document set since individual relevance assessments are known to va
ry widely.
The test collections developed in the TREC workshops have become the collec
tions of choice in the retrieval research community. To verify their reliab
ility, NIST investigated the effect changes in the relevance assessments ha
ve on the evaluation of retrieval results. Very high correlations were foun
d among the rankings of systems Produced using different relevance judgment
sets. The high correlations indicate that the comparative evaluation of re
trieval performance is stable despite substantial differences in relevance
judgments, and thus reaffirm the use of the TREC collections as laboratory
tools. Published by Elsevier Science Ltd.