With the tremendous amount of information that becomes available on the Web
on a daily basis, the ability to quickly develop information agents has be
come a crucial problem. A vital component of any Web-based information agen
t is a set of wrappers that can extract the relevant data from semistructur
ed information sources. Our novel approach to wrapper induction is based on
the idea of hierarchical information extraction, which turns the hard prob
lem of extracting data from an arbitrarily complex document into a series o
f simpler extraction tasks. We introduce an inductive algorithm, STALKER, t
hat generates high accuracy extraction rules based on user-labeled training
examples. Labeling the training data represents the major bottleneck in us
ing wrapper induction techniques, and our experimental results show that ST
ALKER requires up to two orders of magnitude fewer examples than other algo
rithms. Furthermore, STALKER can wrap information sources that could not be
wrapped by existing inductive techniques.