ITA
ENG

Epsilon grid order: An algorithm for the similarity join on massive high-dimensional data

Authors

Bohm, C Braunmuller, B Krebs, F Kriege, HP

Citation

C. Bohm et al., Epsilon grid order: An algorithm for the similarity join on massive high-dimensional data, SIG RECORD, 30(2), 2001, pp. 379-388

Citations number

Categorie Soggetti

Computer Science & Engineering

Journal title

SIGMOD RECORD

ISSN journal

01635808 → ACNP

Volume

Issue

Year of publication

2001

Pages

379 - 388

Database

ISI

SICI code

0163-5808(200106)30:2<379:EGOAAF>2.0.ZU;2-0

Abstract

The similarity join is an important database primitive which has been succe ssfully applied to speed up applications such as similarity search, data an alysis and data mining. The similarity join combines two point sets of a mu ltidimensional vector space such that the result contains all point pairs w here the distance does not exceed a parameter E. In this paper, we propose the Epsilon Grid Order, a new algorithm for determining the similarity join of very large data sets. Our solution is based on a particular sort order of the data points, which is obtained by laying an equi-distant grid with c ell length E over the data space and comparing the grid cells lexicographic ally. A typical problem of grid-based approaches such as MSJ or the epsilon -kdB-tree is that large portions of the data sets must be held simultaneou sly in main memory. Therefore, these approaches do not scale to large data sets. Our technique avoids this problem by an external sorting algorithm an d a particular scheduling strategy during the join phase. In the experiment al evaluation, a substantial improvement over competitive techniques is sho wn.