ITA
ENG

XTRACT: A system for extracting document type descriptors from XML documents

Authors

Garofalakis, M Gionis, A Rastogi, R Seshadri, S Shim, K

Citation

M. Garofalakis et al., XTRACT: A system for extracting document type descriptors from XML documents, SIG RECORD, 29(2), 2000, pp. 165-176

Citations number

Categorie Soggetti

Computer Science & Engineering

Journal title

SIGMOD RECORD

ISSN journal

01635808 → ACNP

Volume

Issue

Year of publication

2000

Pages

165 - 176

Database

ISI

SICI code

0163-5808(200006)29:2<165:XASFED>2.0.ZU;2-P

Abstract

XML is rapidly emerging as the new standard for data representation and exc hange on the Web. An XML document can be accompanied by a Document Type Des criptor (DTD) which plays the role of a schema for an XML data collection. DTDs contain valuable information on the structure of documents and thus ha ve a crucial role in the efficient storage of XML data, as well as the effe ctive formulation and optimization of XML, queries. In this paper, we propo se XTRACT, a novel system for inferring a DTD schema for a database of XML documents. Since the DTD syntax incorporates the full expressive power of r egular expressions, naive approaches typically fail to produce concise and intuitive DTDs. Instead, the XTRACT inference algorithms employ a sequence of sophisticated steps that involve: (1) finding patterns in the input sequ ences and replacing them with regular expressions to generate "general" can didate DTDs, (2) factoring candidate DTDs using adaptations of algorithms f rom the logic optimization literature, and (3) applying the Minimum Descrip tion Length (MDL) principle to find the best DTD among the candidates. The results of our experiments with real-life and synthetic DTDs demonstrate th e effectiveness of XTRACT's approach in inferring concise and semantically meaningful DTD schemas for XML databases.