IMPLEMENTATIONS OF PARTIAL DOCUMENT RANKING USING INVERTED FILES

Authors
Citation
Wyp. Wong et Dl. Lee, IMPLEMENTATIONS OF PARTIAL DOCUMENT RANKING USING INVERTED FILES, Information processing & management, 29(5), 1993, pp. 647-669
Citations number
23
Categorie Soggetti
Information Science & Library Science","Information Science & Library Science","Computer Applications & Cybernetics
ISSN journal
03064573
Volume
29
Issue
5
Year of publication
1993
Pages
647 - 669
Database
ISI
SICI code
0306-4573(1993)29:5<647:IOPDRU>2.0.ZU;2-5
Abstract
Most commercial text retrieval systems employ inverted files to improv e retrieval speed. This paper concerns with the implementations of doc ument ranking based on inverted files. Three heuristic methods for imp lementing the tf x idf weighting strategy, where tf stands for term fr equency and idf stands for inverse document frequency, are studied. Th e basic idea of the heuristic methods is to process the query terms in an order so that as many top documents as possible can be identified without processing all of the query terms. The first heuristic was pro posed by Smeaton and van Rijsbergen and it serves as the basis for com parison with the other two heuristic methods proposed in this paper. T hese three heuristics are evaluated and compared by experimental runs based on the number of disk accesses required for partial document ran king, in which the returned documents contain some, but not necessaril y all, of the requested number of top documents. The results show that the proposed heuristic methods perform better than the method propose d by Smeaton and van Rijsbergen in terms of retrieval accuracy, which is used to indicate the percentage of top documents obtained after a n umber of disk accesses. For total document ranking, in which all of th e requested number of top documents are guaranteed to be returned, no optimization techniques studied so far can lead to substantial perform ance gain. To realize the advantage of the proposed heuristics, two me thods for estimating the retrieval accuracy are studied. Their accurac ies and processing costs are compared. All the experimental runs are b ased on four test collections made available with the SMART system.