ITA
ENG

Learning page-independent heuristics for extracting data from Web pages

Authors

Cohen, WW Fan, W

Citation

Ww. Cohen et W. Fan, Learning page-independent heuristics for extracting data from Web pages, COMPUT NET, 31(11-16), 1999, pp. 1641-1652

Citations number

Categorie Soggetti

Information Tecnology & Communication Systems

Journal title

COMPUTER NETWORKS-THE INTERNATIONAL JOURNAL OF COMPUTER AND TELECOMMUNICATIONS NETWORKING

ISSN journal

13891286 → ACNP

Volume

Issue

11-16

Year of publication

1999

Pages

1641 - 1652

Database

ISI

SICI code

1389-1286(19990517)31:11-16<1641:LPHFED>2.0.ZU;2-2

Abstract

One bottleneck in implementing a system that intelligently queries the Web is developing 'wrappers' - programs that extract data from Web pages. Here we describe a method for learning general, page-independent heuristics for extracting data from HTML documents. The input to our learning system is a set of working wrapper programs, paired with HTML pages they correctly wrap . The output is a general procedure for extracting data that works for many formats and many pages. In experiments with a collection of 84 constrained but realistic extraction problems, we demonstrate that 30% of the problems can be handled perfectly by learned extraction heuristics, and around 50% can be handled acceptably. We also demonstrate that learned page-independen t extraction heuristics can substantially improve the performance of method s for learning page-specific wrappers. (C) 1999 Published by Elsevier Scien ce B.V. All rights reserved.