ITA
ENG

Data bubbles: Quality preserving performance boosting for hierarchical clustering

Authors

Breunig, MM Kriegel, HP Kroger, P Sander, J

Citation

Mm. Breunig et al., Data bubbles: Quality preserving performance boosting for hierarchical clustering, SIG RECORD, 30(2), 2001, pp. 79-90

Citations number

Categorie Soggetti

Computer Science & Engineering

Journal title

SIGMOD RECORD

ISSN journal

01635808 → ACNP

Volume

Issue

Year of publication

2001

Pages

79 - 90

Database

ISI

SICI code

0163-5808(200106)30:2<79:DBQPPB>2.0.ZU;2-R

Abstract

In this paper, we investigate how to scale hierarchical clustering methods (such as OPTICS) to extremely large databases by utilizing data compression methods (such as BIRCH or random sampling). We propose a three step proced ure: 1) compress the data into suitable representative objects; 2) apply th e hierarchical clustering algorithm only to these objects; 3) recover the c lustering structure for the whole data set, based on the result for the com pressed data. The key issue in this approach is to design compressed data i tems such that not only a hierarchical clustering algorithm can be applied, but also that they contain enough information to infer the clustering stru cture of the original data Set in the third step. This is crucial because t he results of hierarchical clustering algorithms, when applied naively to a random sample or to the clustering features (CFs) generated by BIRCH, dete riorate rapidly for higher compression rates. This is due to three key prob lems, which we identify. To solve these problems, we propose an efficient p ost-processing step and the concept of a Data Bubble as a special kind of c ompressed data item. Applying OPTICS to these Data Bubbles allows us to rec over a very accurate approximation of the clustering structure of a large d ata set even for very high compression rates. A comprehensive performance a nd quality evaluation shows that we only trade very little quality of the c lustering result for a great increase in performance.