Global optimization of histograms

Citation
Hv. Jagadish et al., Global optimization of histograms, SIG RECORD, 30(2), 2001, pp. 223-234
Citations number
20
Categorie Soggetti
Computer Science & Engineering
Journal title
SIGMOD RECORD
ISSN journal
01635808 → ACNP
Volume
30
Issue
2
Year of publication
2001
Pages
223 - 234
Database
ISI
SICI code
0163-5808(200106)30:2<223:GOOH>2.0.ZU;2-U
Abstract
Histograms are frequently used to represent the distribution of data values in an attribute of a relation. Most previous work has focused on identifyi ng the optimal histogram (given a limited number of buckets) for a single a ttribute independent of other attributes/histograms. In this paper, we prop ose the idea of global optimization of histograms, i.e., single-attribute h istograms for a set of attributes are optimized collectively so as to minim ize the overall error in using the histograms. The idea is to allocate more buckets to histograms whose attributes are more frequently used and/or dis tributions are highly skewed. While the accuracy of some histograms is pena lized (being assigned fewer buckets), we expect the global error to be low compared to the traditional method (of allocating equal number of buckets t o each histogram). We propose two algorithms to determine the histograms to construct for a co llection of attributes. The first is based on dynamic programming, and the second is a greedy algorithm. We compare the overall error of these algorit hms against the traditional method. Extensive experiments are conducted and the results confirm the benefits of global optimal histograms in reducing the overall error. The extent of improvement depends on the data and query distributions, ranging from no benefit when there is no significant differe nces in the data distributions to over a factor,of 100 reduction in error i n some cases we tried. The time to compute global optimal histogram using dynamic programming is m uch longer than the time to compute optimal histograms separately for each attribute, and the difference widens at a faster rate as the number of hist ograms increases. With the greedy algorithm, the time penalty is small, but the error reduction is somewhat less as well. We propose a third algorithm , called greedy; algorithm with remedy, that has running time similar to th e greedy algorithm, but produces results close to global optimum. In fact, in every experiment that we tried, this algorithm found the exact global op timum.