PARSIMONY: An infrastructure for parallel multidimensional analysis and data mining

Citation
S. Goil et A. Choudhary, PARSIMONY: An infrastructure for parallel multidimensional analysis and data mining, J PAR DISTR, 61(3), 2001, pp. 285-321
Citations number
32
Categorie Soggetti
Computer Science & Engineering
Journal title
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
ISSN journal
07437315 → ACNP
Volume
61
Issue
3
Year of publication
2001
Pages
285 - 321
Database
ISI
SICI code
0743-7315(200103)61:3<285:PAIFPM>2.0.ZU;2-J
Abstract
Multidimensional analysis and online analytical processing (OLAP) operation s require summary information on multidimensional data sets. Most common ar e aggregate operations along one or more dimensions of numerical data value s. Simultaneous calculation of multidimensional aggregates are provided by the Data Cube operator, used to calculate and store summary information on a number of dimensions. This is computed only partially if the number of di mensions is large. Query processing for these applications requires differe nt views of data to gain insight and for effective decision support. Querie s may either be answered from a materialized cube in the data cube or calcu lated on the fly. The multidimensionality of the underlying problem can be represented both i n relational and in multidimensional databases, the latter being a better f it when query performance is the criteria for judgment. Relational database s are scalable in size for OLAP and multidimensional analysis and efforts a re on to make their performance acceptable. On the other hand multidimensio nal databases have proven to provide good performance for such queries, alt hough they are not very scalable. In this article we address (1) scalabilit y in multidimensional systems for OLAP and multidimensional analysis and (2 ) integration of data mining with the OLAP framework. We describe our syste m PARSIMONY, parallel and scalable infrastructure for multidimensional onli ne analytical processing, used for both OLAP and data mining. Sparsity of d ata sets is handled by using chunks to store data either as a dense block u sing multidimensional arrays or as sparse representation using a bit encode d sparse structure. Chunks provide a multidimensional index structure for e fficient dimension oriented data accesses much the same as multidimensional arrays do. Operations within chunks and between chunks are a combination o f relational and multidimensional operations depending on whether the chunk is sparse or dense. Further, we develop parallel algorithms for data minin g on the multidimensional cube structure for attribute-oriented association rules and decision-tree-based classification. These lake advantage of the data organization provided by the multidimensional data model. Performance results for high dimensional data sets on a distributed memory parallel mac hine (IBM SP-2) show good speedup and scalability. (C) 2001 Academic Press.