ITA
ENG

Fast and flexible word searching on compressed text

Authors

de Moura, ES Navarro, G Ziviani, N Baeza-Yates, R

Citation

Es. De Moura et al., Fast and flexible word searching on compressed text, ACM T INF S, 18(2), 2000, pp. 113-139

Citations number

Categorie Soggetti

Information Tecnology & Communication Systems

Journal title

ACM TRANSACTIONS ON INFORMATION SYSTEMS

ISSN journal

10468188 → ACNP

Volume

Issue

Year of publication

2000

Pages

113 - 139

Database

ISI

SICI code

1046-8188(200004)18:2<113:FAFWSO>2.0.ZU;2-S

Abstract

We present a fast compression and decompression technique for natural langu age texts. The novelties are that (1) decompression of arbitrary portions o f the text can be done very efficiently, (2) exact search for words and phr ases can be done on the compressed text directly, using any known sequentia l pattern-matching algorithm, and (3) word-based approximate and extended s earch can also be done efficiently without any decoding. The compression sc heme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. We compress typical En glish texts to about 30% of their original size, against 40% and 35% for Co mpress and Gzip, respectively. Compression time is close to that of Compres s and approximately half the time of Gzip, and decompression time is lower than that of Gzip and one third of that of Compress. We present three algor ithms to search the compressed text;. They allow a large number of variatio ns over the basic word and phrase search capability, such as sets of charac ters, arbitrary regular expressions, and approximate matching. Separators a nd stopwords can be discarded at search time without significantly increasi ng the cost. When searching for simple words, the experiments show that run ning our algorithms on a compressed text is twice as fast as running the be st existing software on the uncompressed version of the same text. When sea rching complex or approximate patterns, our algorithms are up, to 8 times f aster than the search on uncompressed text. We also discuss the impact of o ur technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing onl y for displaying purposes.