ITA
ENG

IMPLEMENTATIONS OF PARTIAL DOCUMENT RANKING USING INVERTED FILES

Authors

WONG WYP LEE DL

Citation

Wyp. Wong et Dl. Lee, IMPLEMENTATIONS OF PARTIAL DOCUMENT RANKING USING INVERTED FILES, Information processing & management, 29(5), 1993, pp. 647-669

Citations number

Categorie Soggetti

Information Science & Library Science","Information Science & Library Science","Computer Applications & Cybernetics

Journal title

Information processing & management → ACNP

ISSN journal

03064573

Volume

Issue

Year of publication

1993

Pages

647 - 669

Database

ISI

SICI code

0306-4573(1993)29:5<647:IOPDRU>2.0.ZU;2-5

Abstract

Most commercial text retrieval systems employ inverted files to improv e retrieval speed. This paper concerns with the implementations of doc ument ranking based on inverted files. Three heuristic methods for imp lementing the tf x idf weighting strategy, where tf stands for term fr equency and idf stands for inverse document frequency, are studied. Th e basic idea of the heuristic methods is to process the query terms in an order so that as many top documents as possible can be identified without processing all of the query terms. The first heuristic was pro posed by Smeaton and van Rijsbergen and it serves as the basis for com parison with the other two heuristic methods proposed in this paper. T hese three heuristics are evaluated and compared by experimental runs based on the number of disk accesses required for partial document ran king, in which the returned documents contain some, but not necessaril y all, of the requested number of top documents. The results show that the proposed heuristic methods perform better than the method propose d by Smeaton and van Rijsbergen in terms of retrieval accuracy, which is used to indicate the percentage of top documents obtained after a n umber of disk accesses. For total document ranking, in which all of th e requested number of top documents are guaranteed to be returned, no optimization techniques studied so far can lead to substantial perform ance gain. To realize the advantage of the proposed heuristics, two me thods for estimating the retrieval accuracy are studied. Their accurac ies and processing costs are compared. All the experimental runs are b ased on four test collections made available with the SMART system.