ITA
ENG

Global optimization of histograms

Authors

Jagadish, HV Jin, H Ooi, BC Tan, KL

Citation

Hv. Jagadish et al., Global optimization of histograms, SIG RECORD, 30(2), 2001, pp. 223-234

Citations number

Categorie Soggetti

Computer Science & Engineering

Journal title

SIGMOD RECORD

ISSN journal

01635808 → ACNP

Volume

Issue

Year of publication

2001

Pages

223 - 234

Database

ISI

SICI code

0163-5808(200106)30:2<223:GOOH>2.0.ZU;2-U

Abstract

Histograms are frequently used to represent the distribution of data values in an attribute of a relation. Most previous work has focused on identifyi ng the optimal histogram (given a limited number of buckets) for a single a ttribute independent of other attributes/histograms. In this paper, we prop ose the idea of global optimization of histograms, i.e., single-attribute h istograms for a set of attributes are optimized collectively so as to minim ize the overall error in using the histograms. The idea is to allocate more buckets to histograms whose attributes are more frequently used and/or dis tributions are highly skewed. While the accuracy of some histograms is pena lized (being assigned fewer buckets), we expect the global error to be low compared to the traditional method (of allocating equal number of buckets t o each histogram). We propose two algorithms to determine the histograms to construct for a co llection of attributes. The first is based on dynamic programming, and the second is a greedy algorithm. We compare the overall error of these algorit hms against the traditional method. Extensive experiments are conducted and the results confirm the benefits of global optimal histograms in reducing the overall error. The extent of improvement depends on the data and query distributions, ranging from no benefit when there is no significant differe nces in the data distributions to over a factor,of 100 reduction in error i n some cases we tried. The time to compute global optimal histogram using dynamic programming is m uch longer than the time to compute optimal histograms separately for each attribute, and the difference widens at a faster rate as the number of hist ograms increases. With the greedy algorithm, the time penalty is small, but the error reduction is somewhat less as well. We propose a third algorithm , called greedy; algorithm with remedy, that has running time similar to th e greedy algorithm, but produces results close to global optimum. In fact, in every experiment that we tried, this algorithm found the exact global op timum.