Tc. Bell et Ih. Witten, THE RELATIONSHIP BETWEEN GREEDY PARSING AND SYMBOLWISE TEXT COMPRESSION, Journal of the Association for Computing Machinery, 41(4), 1994, pp. 708-724
Text compression method can be divided into two classes: symbolwise an
d parsing. Symbolwise method assign codes to individual symbols, while
parsing methods assign codes to groups of consecutive symbols (phrase
s). The set of phrases available to a parsing method is referred to as
a dictionary. The vast majority of parsing methods in the literature
use greedy parsing (including nearly all variations of the popular Ziv
-Lempel methods). When greedy parsing is used, the coder processes a s
tring from left to right, at each step encoding as many symbols as pos
sible with a phrase from the dictionary. This parsing strategy is not
optimal, but an optimal method cannot guarantee a bounded coding delay
. An important problem in compression research has been to establish t
he relationship between symbolwise methods and parsing methods. This p
aper extends prior work that shows that there are symbolwise methods t
hat simulate a subset of greedy parsing methods. We provide a more gen
eral algorithm that takes any nonadaptive greedy parsing method and co
nstructs a symbolwise method that achieves exactly the same compressio
n. Combined with the existence of symbolwise equivalents for two of th
e most significant adaptive parsing methods, this result gives added w
eight to the idea that research aimed at increasing compression should
concentrate on symbolwise methods, while parsing methods should be ch
osen for speed or temporary storage considerations.