Automatic segmentation of text into structured records

Citation
V. Borkar et al., Automatic segmentation of text into structured records, SIG RECORD, 30(2), 2001, pp. 175-186
Citations number
28
Categorie Soggetti
Computer Science & Engineering
Journal title
SIGMOD RECORD
ISSN journal
01635808 → ACNP
Volume
30
Issue
2
Year of publication
2001
Pages
175 - 186
Database
ISI
SICI code
0163-5808(200106)30:2<175:ASOTIS>2.0.ZU;2-I
Abstract
In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today ar e human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the wareh ouse address cleaning problem of transforming dirty addresses stored in lar ge corporate databases as a single text field into subfields like "City" an d "Street". Existing tools rely on hand-tuned, domain-specific rule-based s ystems. We describe a tool DATAMOLD that learns to automatically extract structure when seeded with a small number of training examples. The tool enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that cor roborates multiple sources of information including, the sequence of elemen ts, their length distribution, distinguishing words from the vocabulary and an optional external data dictionary.; Experiments on real-life datasets y ielded accuracy of 90% on Asian addresses and 99% on US addresses. In contr ast, existing information extraction methods based on rule-learning techniq ues yielded considerably lower accuracy.