A wrapper generation toolkit to specify and construct wrappers for web accessible data sources (WebSources)

Citation
L. Bright et al., A wrapper generation toolkit to specify and construct wrappers for web accessible data sources (WebSources), COMP SYS SC, 14(2), 1999, pp. 83-97
Citations number
29
Categorie Soggetti
Computer Science & Engineering
Journal title
COMPUTER SYSTEMS SCIENCE AND ENGINEERING
ISSN journal
02676192 → ACNP
Volume
14
Issue
2
Year of publication
1999
Pages
83 - 97
Database
ISI
SICI code
0267-6192(199903)14:2<83:AWGTTS>2.0.ZU;2-0
Abstract
There is an increase in the number of data sources that can be queried acro ss the WWW. Such sources typically support HTML forms-based interfaces and search engines query collections of suitably indexed data. The data is disp layed via a browser. One drawback to these sources is that there is no stan dard programming interface suitable for applications to submit queries. Sec ond, the output (answer to a query) is not well structured. Structured obje cts have to be extracted from the HTML documents which contain irrelevant d ata and which may be volatile. Third, domain knowledge about the data sourc e is also embedded in HTML documents and must be extracted. To solve these problems, we present technology to define and generate wrappers for Web acc essible sources (WebSources). Our contributions are as follows: (1) Definin g a wrapper interface to specify the capability of WebSources. (2) Developi ng a wrapper generation toolkit of graphical interfaces and specification l anguages to specify the capability of sources and the functionality of the wrapper. The toolkit provides a graphical interface to specify the capabili ties of the sources and to define a simple query translation and answer ext raction process. It supports a language to specify a URLConstructor express ion, for some query. It supports a declarative Qualified-path-expression Ex tractor Language, QEL, to describe a simple Extractor that can extract data from a single HTML document. The toolkit also supports a Complex Extractor Specification Language, CESL to specify extractors with more complex capab ility. The third contribution is (3) Developing the technology to generate a wrapper appropriate to the WebSource, from the specifications.