In this paper, we investigate how to scale hierarchical clustering methods
(such as OPTICS) to extremely large databases by utilizing data compression
methods (such as BIRCH or random sampling). We propose a three step proced
ure: 1) compress the data into suitable representative objects; 2) apply th
e hierarchical clustering algorithm only to these objects; 3) recover the c
lustering structure for the whole data set, based on the result for the com
pressed data. The key issue in this approach is to design compressed data i
tems such that not only a hierarchical clustering algorithm can be applied,
but also that they contain enough information to infer the clustering stru
cture of the original data Set in the third step. This is crucial because t
he results of hierarchical clustering algorithms, when applied naively to a
random sample or to the clustering features (CFs) generated by BIRCH, dete
riorate rapidly for higher compression rates. This is due to three key prob
lems, which we identify. To solve these problems, we propose an efficient p
ost-processing step and the concept of a Data Bubble as a special kind of c
ompressed data item. Applying OPTICS to these Data Bubbles allows us to rec
over a very accurate approximation of the clustering structure of a large d
ata set even for very high compression rates. A comprehensive performance a
nd quality evaluation shows that we only trade very little quality of the c
lustering result for a great increase in performance.