Hierarchical wrapper induction for semistructured information sources

Citation
I. Muslea et al., Hierarchical wrapper induction for semistructured information sources, AUTON-AGENT, 4(1-2), 2001, pp. 93-114
Citations number
15
Categorie Soggetti
AI Robotics and Automatic Control
Journal title
AUTONOMOUS AGENTS AND MULTI-AGENT SYSTEMS
ISSN journal
13872532 → ACNP
Volume
4
Issue
1-2
Year of publication
2001
Pages
93 - 114
Database
ISI
SICI code
1387-2532(200103/06)4:1-2<93:HWIFSI>2.0.ZU;2-Z
Abstract
With the tremendous amount of information that becomes available on the Web on a daily basis, the ability to quickly develop information agents has be come a crucial problem. A vital component of any Web-based information agen t is a set of wrappers that can extract the relevant data from semistructur ed information sources. Our novel approach to wrapper induction is based on the idea of hierarchical information extraction, which turns the hard prob lem of extracting data from an arbitrarily complex document into a series o f simpler extraction tasks. We introduce an inductive algorithm, STALKER, t hat generates high accuracy extraction rules based on user-labeled training examples. Labeling the training data represents the major bottleneck in us ing wrapper induction techniques, and our experimental results show that ST ALKER requires up to two orders of magnitude fewer examples than other algo rithms. Furthermore, STALKER can wrap information sources that could not be wrapped by existing inductive techniques.