Data bubbles: Quality preserving performance boosting for hierarchical clustering

Citation
Mm. Breunig et al., Data bubbles: Quality preserving performance boosting for hierarchical clustering, SIG RECORD, 30(2), 2001, pp. 79-90
Citations number
10
Categorie Soggetti
Computer Science & Engineering
Journal title
SIGMOD RECORD
ISSN journal
01635808 → ACNP
Volume
30
Issue
2
Year of publication
2001
Pages
79 - 90
Database
ISI
SICI code
0163-5808(200106)30:2<79:DBQPPB>2.0.ZU;2-R
Abstract
In this paper, we investigate how to scale hierarchical clustering methods (such as OPTICS) to extremely large databases by utilizing data compression methods (such as BIRCH or random sampling). We propose a three step proced ure: 1) compress the data into suitable representative objects; 2) apply th e hierarchical clustering algorithm only to these objects; 3) recover the c lustering structure for the whole data set, based on the result for the com pressed data. The key issue in this approach is to design compressed data i tems such that not only a hierarchical clustering algorithm can be applied, but also that they contain enough information to infer the clustering stru cture of the original data Set in the third step. This is crucial because t he results of hierarchical clustering algorithms, when applied naively to a random sample or to the clustering features (CFs) generated by BIRCH, dete riorate rapidly for higher compression rates. This is due to three key prob lems, which we identify. To solve these problems, we propose an efficient p ost-processing step and the concept of a Data Bubble as a special kind of c ompressed data item. Applying OPTICS to these Data Bubbles allows us to rec over a very accurate approximation of the clustering structure of a large d ata set even for very high compression rates. A comprehensive performance a nd quality evaluation shows that we only trade very little quality of the c lustering result for a great increase in performance.