Kb. Ng et Pb. Kantor, Predicting the effectiveness of naive data fusion on the basis of system characteristics, J AM S INFO, 51(13), 2000, pp. 1177-1189
Citations number
17
Categorie Soggetti
Library & Information Science
Journal title
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE
Effective automation of the information retrieval task has long been an act
ive area of research, leading to sophisticated retrieval models. With many
IR schemes available, researchers have begun to investigate the benefits of
combining the results of different IR schemes to improve performance, in t
he process called "data fusion." There are many successful data fusion expe
riments reported in in literature, but there are also cases in which it did
not work well. Thus, if would be quite valuable to have a theory that can
predict, in advance, whether fusion of two or more retrieval schemes will b
e worth doing. In previous study (Ng & Kantor, 1998), we identified two pre
dictive variables for the effectiveness of fusion: (a) a list-based measure
of output dissimilarity, and (b) a pair-wise measure of the similarity of
performance of the two schemes. In this article we investigate the predicti
ve power of these two variables in simple symmetrical data fusion. We use t
he in systems participating in the TREC 4 routing task to train a model tha
t predicts the effectiveness of data fusion, and use the in systems partici
pating in the TREC 5 routing task to test that model. The model asks, "when
will fusion perform better than an oracle who uses the best scheme from ea
ch pair?" We explore statistical techniques for fitting the model to the tr
aining data and use the receiver operating characteristic curve of signal d
etection theory to represent the power of the resulting models. The trained
prediction methods predict whether fusion will beat an oracle, at levels m
uch higher than could be achieved by chance.