The issue of reducing the space overhead when indexing large text databases
is becoming more and more important, as the text collections grow in size.
Another subject, which is gaining importance as text databases grow and ge
t more heterogeneous and error prone, is that of flexible string matching.
One of the best tools to make the search more flexible is to allow a limite
d number of differences between the words found and those sought. This is c
alled "approximate text searching," which is becoming more and more popular
. In recent years some indexing schemes with very low space overhead have a
ppeared, some of them dealing with approximate searching. These low overhea
d indices (whose most notorious exponent is Glimpse) are modified inverted
files, where space is saved by making the lists of occurrences point to tex
t blocks instead of exact word positions. Despite their existence, little i
s known about the expected behavior of these "block addressing" indices, an
d even less is known when it comes to cope with approximate search. Our mai
n contribution is an analytical study of the space-time trade-offs for inde
xed text searching. We study the space overhead and retrieval times as func
tions of the block size. We find that, under reasonable assumptions, it is
possible to build an index which is simultaneously sublinear in space overh
ead and in query time. This surprising analytical conclusion is validated w
ith extensive experiments, obtaining typical performance figures. These res
ults are valid for classical exact queries as well as for approximate searc
hing. We apply our analysis to the Web, using recent statistics on the dist
ribution of the document sizes. We show that pointing to documents instead
of to fixed size blocks reduces space requirements but increases search tim
es.