ITA
ENG

STRING-MATCHING IN LEMPEL-ZIV COMPRESSED STRINGS

Authors

FARACH M THORUP M

Citation

M. Farach et M. Thorup, STRING-MATCHING IN LEMPEL-ZIV COMPRESSED STRINGS, Algorithmica, 20(4), 1998, pp. 388-404

Citations number

Categorie Soggetti

Mathematics,"Computer Science Software Graphycs Programming",Mathematics,"Computer Science Software Graphycs Programming

Journal title

Algorithmica → ACNP

ISSN journal

01784617

Volume

Issue

Year of publication

1998

Pages

388 - 404

Database

ISI

SICI code

0178-4617(1998)20:4<388:SILCS>2.0.ZU;2-V

Abstract

String matching and compression are two widely studied arena of comput er science. The theory of string matching has a long association with compression algorithms. Data structures from string matching can be us ed to derive fast implementations of many important compression scheme s, most notably the Lempel-Ziv (LZ77) algorithm. intuitively, once a s tring has been compressed-and therefore its repetitive nature has been elucidated--one might be tempted to exploit this knowledge to speed u p string matching. The Compressed Matching Problem is that of performi ng string matching in a compressed text, without uncompressing it. Mor e formally, let T be a text, let Z be the compressed string representi ng T, and let P be a pattern. The Compressed Matching Problem is that of deciding if P occurs in T, given only P and Z. Compressed matching algorithms have been given for several compression schemes such as LZW . In this paper we give the first nontrivial compressed matching algor ithm for the classic adaptive compression scheme, the LZ77 algorithm. In practice, the LZ77 algorithm is known to compress more than other d ictionary compression schemes, such as LZ78 and LZW, though for string s with constant per bit entropy, all these schemes compress optimally in the limit. However, for strings with o(1) per bit entropy, while it was recently shown that the LZ77 gives compression to within a consta nt factor of optimal, schemes such as LZ78 and LZW may deviate from op timality by an exponential factor. Asymptotically, compressed matching is only relevant if \Z\ = o(\T\), i.e., if the compression ratio \T\/ \Z\ is more than a constant. These results show that LZ77 is the appro priate compression method in such settings. We present an LZ77 compres sed matching algorithm which runs in time O(N log(2) U/N+P) where N = \Z\, U = \T\, and P = \P\. Compare with the naive ''decompresion'' alg orithm, which takes time Theta(U + P) to decide of P occurs in T. Writ ing U + P as N . U/N + P, we see that we have improved the complexity, replacing the compression factor U/N by a factor log(2) U/N. Our algo rithm is competitive in the sense that O(N log(2) U/N + P) = O(U + P), and opportunistic in the sense that O(N log(2) U/N + P) = o(U + P) if N = o(U) and P = o(U).