L. Bright et al., A wrapper generation toolkit to specify and construct wrappers for web accessible data sources (WebSources), COMP SYS SC, 14(2), 1999, pp. 83-97
There is an increase in the number of data sources that can be queried acro
ss the WWW. Such sources typically support HTML forms-based interfaces and
search engines query collections of suitably indexed data. The data is disp
layed via a browser. One drawback to these sources is that there is no stan
dard programming interface suitable for applications to submit queries. Sec
ond, the output (answer to a query) is not well structured. Structured obje
cts have to be extracted from the HTML documents which contain irrelevant d
ata and which may be volatile. Third, domain knowledge about the data sourc
e is also embedded in HTML documents and must be extracted. To solve these
problems, we present technology to define and generate wrappers for Web acc
essible sources (WebSources). Our contributions are as follows: (1) Definin
g a wrapper interface to specify the capability of WebSources. (2) Developi
ng a wrapper generation toolkit of graphical interfaces and specification l
anguages to specify the capability of sources and the functionality of the
wrapper. The toolkit provides a graphical interface to specify the capabili
ties of the sources and to define a simple query translation and answer ext
raction process. It supports a language to specify a URLConstructor express
ion, for some query. It supports a declarative Qualified-path-expression Ex
tractor Language, QEL, to describe a simple Extractor that can extract data
from a single HTML document. The toolkit also supports a Complex Extractor
Specification Language, CESL to specify extractors with more complex capab
ility. The third contribution is (3) Developing the technology to generate
a wrapper appropriate to the WebSource, from the specifications.