Predicting the effectiveness of naive data fusion on the basis of system characteristics

Authors
Citation
Kb. Ng et Pb. Kantor, Predicting the effectiveness of naive data fusion on the basis of system characteristics, J AM S INFO, 51(13), 2000, pp. 1177-1189
Citations number
17
Categorie Soggetti
Library & Information Science
Journal title
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE
ISSN journal
00028231 → ACNP
Volume
51
Issue
13
Year of publication
2000
Pages
1177 - 1189
Database
ISI
SICI code
0002-8231(200011)51:13<1177:PTEOND>2.0.ZU;2-U
Abstract
Effective automation of the information retrieval task has long been an act ive area of research, leading to sophisticated retrieval models. With many IR schemes available, researchers have begun to investigate the benefits of combining the results of different IR schemes to improve performance, in t he process called "data fusion." There are many successful data fusion expe riments reported in in literature, but there are also cases in which it did not work well. Thus, if would be quite valuable to have a theory that can predict, in advance, whether fusion of two or more retrieval schemes will b e worth doing. In previous study (Ng & Kantor, 1998), we identified two pre dictive variables for the effectiveness of fusion: (a) a list-based measure of output dissimilarity, and (b) a pair-wise measure of the similarity of performance of the two schemes. In this article we investigate the predicti ve power of these two variables in simple symmetrical data fusion. We use t he in systems participating in the TREC 4 routing task to train a model tha t predicts the effectiveness of data fusion, and use the in systems partici pating in the TREC 5 routing task to test that model. The model asks, "when will fusion perform better than an oracle who uses the best scheme from ea ch pair?" We explore statistical techniques for fitting the model to the tr aining data and use the receiver operating characteristic curve of signal d etection theory to represent the power of the resulting models. The trained prediction methods predict whether fusion will beat an oracle, at levels m uch higher than could be achieved by chance.