Reconciling schemas of disparate data sources: A machine-learning approach

Citation
Ah. Doan et al., Reconciling schemas of disparate data sources: A machine-learning approach, SIG RECORD, 30(2), 2001, pp. 509-520
Citations number
26
Categorie Soggetti
Computer Science & Engineering
Journal title
SIGMOD RECORD
ISSN journal
01635808 → ACNP
Volume
30
Issue
2
Year of publication
2001
Pages
509 - 520
Database
ISI
SICI code
0163-5808(200106)30:2<509:RSODDS>2.0.ZU;2-U
Abstract
A data-integration system provides access to a multitude of data sources th rough a single mediated schema. A key bottleneck in building such systems h as been the laborious manual construction of semantic mappings between the source schemas and the mediated schema. We describe LSD, a system that empl oys and extends current machine-learning techniques to semi-automatically f ind such mappings. LSD first asks the user to provide the semantic mappings for a small set of data sources, then uses these mappings together with th e sources to train a set of learners. Each learner exploits a different typ e of information either in the source schemas or in their data. Once the le arners have been trained, LSD finds semantic mappings for a new data source by applying the learners, then combining their predictions using a meta-le arner. To further improve matching accuracy, we extend machine learning tec hniques so that LSD can incorporate domain constraints as:an additional sou rce of knowledge, and develop a novel learner that utilizes the structural information in XML documents. Our approach thus is distinguished in that it incorporates multiple types of knowledge. Importantly, its architecture is extensible to additional learners that may exploit new kinds of informatio n. We describe a set of experiments on several real-world domains, and show that LSD proposes semantic mappings with a high degree of accuracy.