Epsilon grid order: An algorithm for the similarity join on massive high-dimensional data

Citation
C. Bohm et al., Epsilon grid order: An algorithm for the similarity join on massive high-dimensional data, SIG RECORD, 30(2), 2001, pp. 379-388
Citations number
30
Categorie Soggetti
Computer Science & Engineering
Journal title
SIGMOD RECORD
ISSN journal
01635808 → ACNP
Volume
30
Issue
2
Year of publication
2001
Pages
379 - 388
Database
ISI
SICI code
0163-5808(200106)30:2<379:EGOAAF>2.0.ZU;2-0
Abstract
The similarity join is an important database primitive which has been succe ssfully applied to speed up applications such as similarity search, data an alysis and data mining. The similarity join combines two point sets of a mu ltidimensional vector space such that the result contains all point pairs w here the distance does not exceed a parameter E. In this paper, we propose the Epsilon Grid Order, a new algorithm for determining the similarity join of very large data sets. Our solution is based on a particular sort order of the data points, which is obtained by laying an equi-distant grid with c ell length E over the data space and comparing the grid cells lexicographic ally. A typical problem of grid-based approaches such as MSJ or the epsilon -kdB-tree is that large portions of the data sets must be held simultaneou sly in main memory. Therefore, these approaches do not scale to large data sets. Our technique avoids this problem by an external sorting algorithm an d a particular scheduling strategy during the join phase. In the experiment al evaluation, a substantial improvement over competitive techniques is sho wn.