In this paper we present a method for automatically segmenting unformatted
text records into structured elements. Several useful data sources today ar
e human-generated as continuous text whereas convenient usage requires the
data to be organized as structured records. A prime motivation is the wareh
ouse address cleaning problem of transforming dirty addresses stored in lar
ge corporate databases as a single text field into subfields like "City" an
d "Street". Existing tools rely on hand-tuned, domain-specific rule-based s
ystems.
We describe a tool DATAMOLD that learns to automatically extract structure
when seeded with a small number of training examples. The tool enhances on
Hidden Markov Models (HMM) to build a powerful probabilistic model that cor
roborates multiple sources of information including, the sequence of elemen
ts, their length distribution, distinguishing words from the vocabulary and
an optional external data dictionary.; Experiments on real-life datasets y
ielded accuracy of 90% on Asian addresses and 99% on US addresses. In contr
ast, existing information extraction methods based on rule-learning techniq
ues yielded considerably lower accuracy.