ITA
ENG

Schemas for web data: a reverse engineering approach

Authors

Bhowmick, SS Ng, WK Madria, S

Citation

Ss. Bhowmick et al., Schemas for web data: a reverse engineering approach, DATA KN ENG, 39(2), 2001, pp. 105-142

Citations number

Categorie Soggetti

AI Robotics and Automatic Control

Journal title

DATA & KNOWLEDGE ENGINEERING

ISSN journal

0169023X → ACNP

Volume

Issue

Year of publication

2001

Pages

105 - 142

Database

ISI

SICI code

0169-023X(200111)39:2<105:SFWDAR>2.0.ZU;2-B

Abstract

In this paper, we show how to generate schemas of a set of HTML or XML docu ments retrieved from the web in the context of our web warehousing system c alled WHOWEDA (WareHouse Of WEb DAta). Web schemas are used to bind a web t able that contains a collection of interlinked web documents called web tup les. These schemas specify the metadata, content and structural properties (in the form of predicates) shared by the web documents and hyperlinks in t he web table. They also summarize the hyperlink structure of these document s using the notion of connectivities. Web schemas are generated in three st ages. In the first stage, a simple or complex web schema is generated from the user's query (coupling query). In the next stage, the complex web schem a is decomposed into a set of simple web schemas. These two stages are perf ormed without inspecting the data instances, i.e., web tuples. Finally, in the last stage the set of simple web schemas are pruned by inspecting the h yperlink structure of the web tuples. We also discuss the formal algorithm for generating a set of simple web schemas from a coupling query. (C) 2001 Elsevier Science B.V. All rights reserved.