Histograms are frequently used to represent the distribution of data values
in an attribute of a relation. Most previous work has focused on identifyi
ng the optimal histogram (given a limited number of buckets) for a single a
ttribute independent of other attributes/histograms. In this paper, we prop
ose the idea of global optimization of histograms, i.e., single-attribute h
istograms for a set of attributes are optimized collectively so as to minim
ize the overall error in using the histograms. The idea is to allocate more
buckets to histograms whose attributes are more frequently used and/or dis
tributions are highly skewed. While the accuracy of some histograms is pena
lized (being assigned fewer buckets), we expect the global error to be low
compared to the traditional method (of allocating equal number of buckets t
o each histogram).
We propose two algorithms to determine the histograms to construct for a co
llection of attributes. The first is based on dynamic programming, and the
second is a greedy algorithm. We compare the overall error of these algorit
hms against the traditional method. Extensive experiments are conducted and
the results confirm the benefits of global optimal histograms in reducing
the overall error. The extent of improvement depends on the data and query
distributions, ranging from no benefit when there is no significant differe
nces in the data distributions to over a factor,of 100 reduction in error i
n some cases we tried.
The time to compute global optimal histogram using dynamic programming is m
uch longer than the time to compute optimal histograms separately for each
attribute, and the difference widens at a faster rate as the number of hist
ograms increases. With the greedy algorithm, the time penalty is small, but
the error reduction is somewhat less as well. We propose a third algorithm
, called greedy; algorithm with remedy, that has running time similar to th
e greedy algorithm, but produces results close to global optimum. In fact,
in every experiment that we tried, this algorithm found the exact global op
timum.