ITA
ENG

Block addressing indices for approximate text retrieval

Authors

Baeza-Yates, R Navarro, G

Citation

R. Baeza-yates et G. Navarro, Block addressing indices for approximate text retrieval, J AM S INFO, 51(1), 2000, pp. 69-82

Citations number

Categorie Soggetti

Library & Information Science

Journal title

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE

ISSN journal

00028231 → ACNP

Volume

Issue

Year of publication

2000

Pages

69 - 82

Database

ISI

SICI code

0002-8231(200001)51:1<69:BAIFAT>2.0.ZU;2-H

Abstract

The issue of reducing the space overhead when indexing large text databases is becoming more and more important, as the text collections grow in size. Another subject, which is gaining importance as text databases grow and ge t more heterogeneous and error prone, is that of flexible string matching. One of the best tools to make the search more flexible is to allow a limite d number of differences between the words found and those sought. This is c alled "approximate text searching," which is becoming more and more popular . In recent years some indexing schemes with very low space overhead have a ppeared, some of them dealing with approximate searching. These low overhea d indices (whose most notorious exponent is Glimpse) are modified inverted files, where space is saved by making the lists of occurrences point to tex t blocks instead of exact word positions. Despite their existence, little i s known about the expected behavior of these "block addressing" indices, an d even less is known when it comes to cope with approximate search. Our mai n contribution is an analytical study of the space-time trade-offs for inde xed text searching. We study the space overhead and retrieval times as func tions of the block size. We find that, under reasonable assumptions, it is possible to build an index which is simultaneously sublinear in space overh ead and in query time. This surprising analytical conclusion is validated w ith extensive experiments, obtaining typical performance figures. These res ults are valid for classical exact queries as well as for approximate searc hing. We apply our analysis to the Web, using recent statistics on the dist ribution of the document sizes. We show that pointing to documents instead of to fixed size blocks reduces space requirements but increases search tim es.