Data reconciliation is the process of matching records across different dat
abases. Data reconciliation requires "joining" on fields that have traditio
nally been non-key fields. Generally, the operational databases are of suff
icient quality for the purposes for which they were initially designed but
since the data in the different databases do not have a canonical structure
and may have errors, approximate matching algorithms are required.
Approximate matching algorithms can have many different parameter settings.
The number of parameters will affect the complexity of the algorithm due t
o the number of comparisons needed to identify matching records across diff
erent datasets. For large datasets that are prevalent in data warehouses, t
he increased complexity may result in impractical solutions.
In this paper, we describe an efficient method for data reconciliation. Our
main contribution is the incorporation of machine learning and statistical
techniques to reduce the complexity of the matching algorithms via identif
ication and elimination of redundant or useless parameters. We have conduct
ed experiments on actual data that demonstrate the validity of our techniqu
es. In our experiments, the techniques reduced complexity by 50% while sign
ificantly increasing matching accuracy. (C) 2001 Telcordia Technologies Inc
. Published by Elsevier Science Inc. All rights reserved.