Efficient data reconciliation

Citation
M. Cochinwala et al., Efficient data reconciliation, INF SCI, 137(1-4), 2001, pp. 1-15
Citations number
17
Categorie Soggetti
Information Tecnology & Communication Systems
Journal title
INFORMATION SCIENCES
ISSN journal
00200255 → ACNP
Volume
137
Issue
1-4
Year of publication
2001
Pages
1 - 15
Database
ISI
SICI code
0020-0255(200109)137:1-4<1:EDR>2.0.ZU;2-8
Abstract
Data reconciliation is the process of matching records across different dat abases. Data reconciliation requires "joining" on fields that have traditio nally been non-key fields. Generally, the operational databases are of suff icient quality for the purposes for which they were initially designed but since the data in the different databases do not have a canonical structure and may have errors, approximate matching algorithms are required. Approximate matching algorithms can have many different parameter settings. The number of parameters will affect the complexity of the algorithm due t o the number of comparisons needed to identify matching records across diff erent datasets. For large datasets that are prevalent in data warehouses, t he increased complexity may result in impractical solutions. In this paper, we describe an efficient method for data reconciliation. Our main contribution is the incorporation of machine learning and statistical techniques to reduce the complexity of the matching algorithms via identif ication and elimination of redundant or useless parameters. We have conduct ed experiments on actual data that demonstrate the validity of our techniqu es. In our experiments, the techniques reduced complexity by 50% while sign ificantly increasing matching accuracy. (C) 2001 Telcordia Technologies Inc . Published by Elsevier Science Inc. All rights reserved.