ITA
ENG

A knowledge-based approach for duplicate elimination in data cleaning

Authors

Low, WL Lee, ML Ling, TW

Citation

Wl. Low et al., A knowledge-based approach for duplicate elimination in data cleaning, INF SYST, 26(8), 2001, pp. 585-606

Citations number

Categorie Soggetti

Information Tecnology & Communication Systems

Journal title

INFORMATION SYSTEMS

ISSN journal

03064379 → ACNP

Volume

Issue

Year of publication

2001

Pages

585 - 606

Database

ISI

SICI code

0306-4379(200112)26:8<585:AKAFDE>2.0.ZU;2-6

Abstract

Existing duplicate elimination methods for data cleaning work on the basis of computing the degree of similarity between nearby records in a sorted da tabase. High recall can be achieved by accepting records with low degrees o f similarity as duplicates, at the cost of lower precision. High precision can be achieved analogously at the cost of lower recall. This is the recall -precision dilemma. We develop a generic knowledge-based framework for effe ctive data cleaning that can implement any existing data cleaning strategie s and more. We propose a new method for computing transitive closure under uncertainty for dealing with the merging of groups of inexact duplicate rec ords and explain why small changes to window sizes has little effect on the results of the sorted neighborhood method. Experimental study with two rea l-world datasets show that this approach can accurately identify duplicates and anomalies with high recall and precision, thus effectively resolving t he recall-precision dilemma. (C) 2001 Published by Elsevier Science Ltd.