U. Manber, A TEXT COMPRESSION SCHEME THAT ALLOWS FAST SEARCHING DIRECTLY IN THE COMPRESSED FILE, ACM transactions on information systems, 15(2), 1997, pp. 124-136
Citations number
22
Categorie Soggetti
Information Science & Library Science","Computer Science Information Systems
A new text compression scheme is presented in this article. The main p
urpose of this scheme is to speed up string matching by searching the
compressed file directly. The scheme requires no modification of the s
tring-matching algorithm, which is used as a black box; any string-mat
ching procedure can be used. Instead, the pattern is modified; only th
e outcome of the matching of the modified pattern against the compress
ed file is decompressed. Since the compressed file is smaller than the
original file, the search is faster both in terms of I/O time and pro
cessing time than a search in the original file. For typical text file
s, we achieve about 30% reduction of space and slightly less of search
time. A 30% space saving is not competitive with good text compressio
n schemes, and thus should not be used where space is the predominant
concern. The intended applications of this scheme are files that are s
earched often, such as catalogs, bibliographic files, and address book
s. Such files are typically not compressed, but with this scheme they
can remain compressed indefinitely, saving space while allowing faster
search at the same time. A particular application to an information r
etrieval system that we developed is also discussed.