One bottleneck in implementing a system that intelligently queries the Web
is developing 'wrappers' - programs that extract data from Web pages. Here
we describe a method for learning general, page-independent heuristics for
extracting data from HTML documents. The input to our learning system is a
set of working wrapper programs, paired with HTML pages they correctly wrap
. The output is a general procedure for extracting data that works for many
formats and many pages. In experiments with a collection of 84 constrained
but realistic extraction problems, we demonstrate that 30% of the problems
can be handled perfectly by learned extraction heuristics, and around 50%
can be handled acceptably. We also demonstrate that learned page-independen
t extraction heuristics can substantially improve the performance of method
s for learning page-specific wrappers. (C) 1999 Published by Elsevier Scien
ce B.V. All rights reserved.