Multidimensional analysis and online analytical processing (OLAP) operation
s require summary information on multidimensional data sets. Most common ar
e aggregate operations along one or more dimensions of numerical data value
s. Simultaneous calculation of multidimensional aggregates are provided by
the Data Cube operator, used to calculate and store summary information on
a number of dimensions. This is computed only partially if the number of di
mensions is large. Query processing for these applications requires differe
nt views of data to gain insight and for effective decision support. Querie
s may either be answered from a materialized cube in the data cube or calcu
lated on the fly.
The multidimensionality of the underlying problem can be represented both i
n relational and in multidimensional databases, the latter being a better f
it when query performance is the criteria for judgment. Relational database
s are scalable in size for OLAP and multidimensional analysis and efforts a
re on to make their performance acceptable. On the other hand multidimensio
nal databases have proven to provide good performance for such queries, alt
hough they are not very scalable. In this article we address (1) scalabilit
y in multidimensional systems for OLAP and multidimensional analysis and (2
) integration of data mining with the OLAP framework. We describe our syste
m PARSIMONY, parallel and scalable infrastructure for multidimensional onli
ne analytical processing, used for both OLAP and data mining. Sparsity of d
ata sets is handled by using chunks to store data either as a dense block u
sing multidimensional arrays or as sparse representation using a bit encode
d sparse structure. Chunks provide a multidimensional index structure for e
fficient dimension oriented data accesses much the same as multidimensional
arrays do. Operations within chunks and between chunks are a combination o
f relational and multidimensional operations depending on whether the chunk
is sparse or dense. Further, we develop parallel algorithms for data minin
g on the multidimensional cube structure for attribute-oriented association
rules and decision-tree-based classification. These lake advantage of the
data organization provided by the multidimensional data model. Performance
results for high dimensional data sets on a distributed memory parallel mac
hine (IBM SP-2) show good speedup and scalability. (C) 2001 Academic Press.