A TEXT COMPRESSION SCHEME THAT ALLOWS FAST SEARCHING DIRECTLY IN THE COMPRESSED FILE

Authors
Citation
U. Manber, A TEXT COMPRESSION SCHEME THAT ALLOWS FAST SEARCHING DIRECTLY IN THE COMPRESSED FILE, ACM transactions on information systems, 15(2), 1997, pp. 124-136
Citations number
22
Categorie Soggetti
Information Science & Library Science","Computer Science Information Systems
ISSN journal
10468188
Volume
15
Issue
2
Year of publication
1997
Pages
124 - 136
Database
ISI
SICI code
1046-8188(1997)15:2<124:ATCSTA>2.0.ZU;2-O
Abstract
A new text compression scheme is presented in this article. The main p urpose of this scheme is to speed up string matching by searching the compressed file directly. The scheme requires no modification of the s tring-matching algorithm, which is used as a black box; any string-mat ching procedure can be used. Instead, the pattern is modified; only th e outcome of the matching of the modified pattern against the compress ed file is decompressed. Since the compressed file is smaller than the original file, the search is faster both in terms of I/O time and pro cessing time than a search in the original file. For typical text file s, we achieve about 30% reduction of space and slightly less of search time. A 30% space saving is not competitive with good text compressio n schemes, and thus should not be used where space is the predominant concern. The intended applications of this scheme are files that are s earched often, such as catalogs, bibliographic files, and address book s. Such files are typically not compressed, but with this scheme they can remain compressed indefinitely, saving space while allowing faster search at the same time. A particular application to an information r etrieval system that we developed is also discussed.