Over the last decades, improvements in CPU speed have outpaced improvements
in main memory and disk access rates by orders of magnitude, enabling the
use of data compression techniques to improve the performance of database s
ystems. Previous work describes the benefits of compression for numerical a
ttributes, where data is stored in compressed format on disk. Despite the a
bundance of string-valued attributes in relational schemas there is little
work on compression for string attributes in a database context. Moreover,
none of the,previous work suitably addresses the role of the query optimize
r: During query execution,data is either eagerly decompressed when it is re
ad into main memory, or data lazily stays compressed in main memory and is
decompressed on demand only.
In this paper, we present in effective approach for database compression ba
sed on lightweight, attribute-level compression techniques. We propose a Hi
erarchical Dictionary Encoding strategy that intelligently selects the most
effective compression method for string-valued attributes. We show that ea
ger and lazy decompression strategies produce suboptimal plans for queries
involving compressed string attributes. We then formalize the problem of co
mpression-aware query optimization and propose one provably optimal and two
fast heuristic algorithms for selecting a query plan for relational schema
s with compressed attributes; our algorithms can easily be integrated into
existing cost-based query optimizers. Experiments using TPC-H data demonstr
ate the impact of our string compression methods and show the importance of
compression-aware query optimization. Our approach results in up to an ord
er speed up over existing approaches.