ITA
ENG

Efficient data reconciliation

Authors

Cochinwala, M Kurien, V Lalk, G Shasha, D

Citation

M. Cochinwala et al., Efficient data reconciliation, INF SCI, 137(1-4), 2001, pp. 1-15

Citations number

Categorie Soggetti

Information Tecnology & Communication Systems

Journal title

INFORMATION SCIENCES

ISSN journal

00200255 → ACNP

Volume

137

Issue

1-4

Year of publication

2001

Pages

1 - 15

Database

ISI

SICI code

0020-0255(200109)137:1-4<1:EDR>2.0.ZU;2-8

Abstract

Data reconciliation is the process of matching records across different dat abases. Data reconciliation requires "joining" on fields that have traditio nally been non-key fields. Generally, the operational databases are of suff icient quality for the purposes for which they were initially designed but since the data in the different databases do not have a canonical structure and may have errors, approximate matching algorithms are required. Approximate matching algorithms can have many different parameter settings. The number of parameters will affect the complexity of the algorithm due t o the number of comparisons needed to identify matching records across diff erent datasets. For large datasets that are prevalent in data warehouses, t he increased complexity may result in impractical solutions. In this paper, we describe an efficient method for data reconciliation. Our main contribution is the incorporation of machine learning and statistical techniques to reduce the complexity of the matching algorithms via identif ication and elimination of redundant or useless parameters. We have conduct ed experiments on actual data that demonstrate the validity of our techniqu es. In our experiments, the techniques reduced complexity by 50% while sign ificantly increasing matching accuracy. (C) 2001 Telcordia Technologies Inc . Published by Elsevier Science Inc. All rights reserved.