Block addressing indices for approximate text retrieval

Citation
R. Baeza-yates et G. Navarro, Block addressing indices for approximate text retrieval, J AM S INFO, 51(1), 2000, pp. 69-82
Citations number
37
Categorie Soggetti
Library & Information Science
Journal title
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE
ISSN journal
00028231 → ACNP
Volume
51
Issue
1
Year of publication
2000
Pages
69 - 82
Database
ISI
SICI code
0002-8231(200001)51:1<69:BAIFAT>2.0.ZU;2-H
Abstract
The issue of reducing the space overhead when indexing large text databases is becoming more and more important, as the text collections grow in size. Another subject, which is gaining importance as text databases grow and ge t more heterogeneous and error prone, is that of flexible string matching. One of the best tools to make the search more flexible is to allow a limite d number of differences between the words found and those sought. This is c alled "approximate text searching," which is becoming more and more popular . In recent years some indexing schemes with very low space overhead have a ppeared, some of them dealing with approximate searching. These low overhea d indices (whose most notorious exponent is Glimpse) are modified inverted files, where space is saved by making the lists of occurrences point to tex t blocks instead of exact word positions. Despite their existence, little i s known about the expected behavior of these "block addressing" indices, an d even less is known when it comes to cope with approximate search. Our mai n contribution is an analytical study of the space-time trade-offs for inde xed text searching. We study the space overhead and retrieval times as func tions of the block size. We find that, under reasonable assumptions, it is possible to build an index which is simultaneously sublinear in space overh ead and in query time. This surprising analytical conclusion is validated w ith extensive experiments, obtaining typical performance figures. These res ults are valid for classical exact queries as well as for approximate searc hing. We apply our analysis to the Web, using recent statistics on the dist ribution of the document sizes. We show that pointing to documents instead of to fixed size blocks reduces space requirements but increases search tim es.