Wrapping Web data into XML

Citation
W. Han et al., Wrapping Web data into XML, SIG RECORD, 30(3), 2001, pp. 33-38
Citations number
8
Categorie Soggetti
Computer Science & Engineering
Journal title
SIGMOD RECORD
ISSN journal
01635808 → ACNP
Volume
30
Issue
3
Year of publication
2001
Pages
33 - 38
Database
ISI
SICI code
0163-5808(200109)30:3<33:WWDIX>2.0.ZU;2-7
Abstract
The vast majority of online information is part of the World Wide Web. In o rder to use this information for more than human browsing, web pages in HTM L must be converted into a format meaningful to software programs. Wrappers have been a useful technique to convert HTML documents into semantically m eaningful XML files. However, developing wrappers is slow and labor-intensi ve. Further, frequent changes on the HTML documents typically require frequ ent changes in the wrappers. This paper describes XWRAP Elite, a tool to au tomatically generate robust wrappers. XWRAP breaks down the conversion proc ess into three steps. First, discover where the data is located in an HTML page and separating the data into individual objects. Second, decompose obj ects into data elements. Third, mark objects and elements in an output form at. XWRAP Elite automates the first two steps and minimizes human involveme nt in marking output data. Our experience shows that XWRAP is able to creat e useful wrapper software for a wide variety of real world HTML documents.