A data cube is a popular organization for summary data. A cube is simply a
multidimensional structure that contains in each cell an aggregate value, i
.e., the result of applying an aggregate function to an underlying relation
. In practical situations, cubes can require a large amount of storage, so,
compressing them is of practical importance. In this paper, we propose an
approximation technique that reduces the storage cost of the cube at the pr
ice of getting approximate answers for the queries posed against the cube.
The idea is to characterize regions of the cube by using statistical models
whose description take less space than the data itself. Then, the model pa
rameters can be used to estimate the cube cells with a certain level of acc
uracy. To increase the accuracy, and to guarantee the level of error in the
query answers, some of the "outliers" (i.e., cells that incur in the large
st errors when estimated), are retained. The storage taken by the model par
ameters and the retained cells, of course, should take a fraction of the sp
ace of the full cube and the estimation procedure should be faster than com
puting the data from the underlying relations. We use loglinear models to m
odel the cube regions. Experiments show that the errors introduced in typic
al queries are small even when the description is substantially smaller tha
n the full cube. Since cubes are used to support data analysis and analysts
are rarely interested in the precise values of the aggregates (but rather
in trends), providing approximate answers is, in most cases, a satisfactory
compromise. Although other techniques have been used for the purpose of co
mpressing data cubes, ours has the advantage of using parametric (loglinear
) models and the retaining of outliers, which enables the system to give er
ror guarantees that are data independent, for every query posed on the data
cube. The models also offer information about the underlying structure of
the data modeled by them. Moreover, these models are relatively easy to upd
ate dynamically as data is added to the warehouse.