Generating finite-state transducers for semi-structured data extraction from the Web

Authors
Citation
Cn. Hsu et Mt. Dung, Generating finite-state transducers for semi-structured data extraction from the Web, INF SYST, 23(8), 1998, pp. 521-538
Citations number
20
Categorie Soggetti
Information Tecnology & Communication Systems
Journal title
INFORMATION SYSTEMS
ISSN journal
03064379 → ACNP
Volume
23
Issue
8
Year of publication
1998
Pages
521 - 538
Database
ISI
SICI code
0306-4379(199812)23:8<521:GFTFSD>2.0.ZU;2-R
Abstract
Integrating a large number of Web information sources may significantly inc rease the utility of the World-Wide Web. A promising solution to the integr ation is through the use of a Web Information mediator that provides seamle ss, transparent access for the clients. Information mediators need wrappers to access a Web source as a structured database, but building wrappers by hand is impractical. Previous work on wrapper induction is too restrictive to handle a large number of Web pages that contain tuples with missing attr ibutes, multiple values, variant attribute permutations, exceptions and typ os. This paper presents SoftMealy, a novel wrapper representation formalism . This representation is based on a finite-state transducer (FST) and conte xtual rules. This approach can wrap a wide range of semistructured Web page s because FSTs can encode each different attribute permutation as a path. A SoftMealy wrapper can be induced from a handful of labeled examples using our generalization algorithm. We have implemented this approach into a prot otype system and tested it on real Web pages. The performance statistics sh ows that the sizes of the induced wrappers as well as the required training effort are linear with regard to the structural variance of the test pages . Our experiment also shows that the induced wrappers can generalize over u nseen pages. (C)1998 Elsevier Science Ltd. All rights reserved.