iDiff: Informative summarization of differences in multidimensional aggregates

Authors
Citation
S. Sarawagi, iDiff: Informative summarization of differences in multidimensional aggregates, DATA M K D, 5(4), 2001, pp. 255-276
Citations number
18
Categorie Soggetti
AI Robotics and Automatic Control
Journal title
DATA MINING AND KNOWLEDGE DISCOVERY
ISSN journal
13845810 → ACNP
Volume
5
Issue
4
Year of publication
2001
Pages
255 - 276
Database
ISI
SICI code
1384-5810(2001)5:4<255:IISODI>2.0.ZU;2-M
Abstract
Multidimensional OLAP products provide an excellent opportunity for integra ting mining functionality because of their widespread acceptance as a decis ion support tool and their existing heavy reliance on manual, user-driven a nalysis. Most OLAP products are rather simplistic and rely heavily on the u ser's intuition to manually drive the discovery process. Such ad hoc user-d riven exploration gets tedious and error-prone as data dimensionality and s ize increases. Our goal is to automate these manual discovery processes. In this paper we present an example of such automation through a iDiff operat or that in a single step returns summarized reasons for drops or increases observed at an aggregated level. We formulate this as a problem of summarizing the difference between two mu ltidimensional arrays of real numbers. We develop a general framework for s uch summarization and propose a specific formulation for the case of OLAP a ggregates. We develop an information theoretic formulation for expressing t he reasons that is compact and easy to interpret. We design an efficient dy namic programming algorithm that requires only one pass of the data and use s a small amount of memory independent of the data size. This allows easy i ntegration with existing OLAP products. Our prototype has been tested on th e Microsoft OLAP server, DB2/UDB and Oracle 8i. Experiments using the OLAP benchmark demonstrate (1) scalability of our algorithm as the size and dime nsionality of the cube increases and (2) feasibility of getting interactive answers with modest hardware resources.