Learning page-independent heuristics for extracting data from Web pages

Authors
Citation
Ww. Cohen et W. Fan, Learning page-independent heuristics for extracting data from Web pages, COMPUT NET, 31(11-16), 1999, pp. 1641-1652
Citations number
19
Categorie Soggetti
Information Tecnology & Communication Systems
Journal title
COMPUTER NETWORKS-THE INTERNATIONAL JOURNAL OF COMPUTER AND TELECOMMUNICATIONS NETWORKING
ISSN journal
13891286 → ACNP
Volume
31
Issue
11-16
Year of publication
1999
Pages
1641 - 1652
Database
ISI
SICI code
1389-1286(19990517)31:11-16<1641:LPHFED>2.0.ZU;2-2
Abstract
One bottleneck in implementing a system that intelligently queries the Web is developing 'wrappers' - programs that extract data from Web pages. Here we describe a method for learning general, page-independent heuristics for extracting data from HTML documents. The input to our learning system is a set of working wrapper programs, paired with HTML pages they correctly wrap . The output is a general procedure for extracting data that works for many formats and many pages. In experiments with a collection of 84 constrained but realistic extraction problems, we demonstrate that 30% of the problems can be handled perfectly by learned extraction heuristics, and around 50% can be handled acceptably. We also demonstrate that learned page-independen t extraction heuristics can substantially improve the performance of method s for learning page-specific wrappers. (C) 1999 Published by Elsevier Scien ce B.V. All rights reserved.