XTRACT: A system for extracting document type descriptors from XML documents

Citation
M. Garofalakis et al., XTRACT: A system for extracting document type descriptors from XML documents, SIG RECORD, 29(2), 2000, pp. 165-176
Citations number
25
Categorie Soggetti
Computer Science & Engineering
Journal title
SIGMOD RECORD
ISSN journal
01635808 → ACNP
Volume
29
Issue
2
Year of publication
2000
Pages
165 - 176
Database
ISI
SICI code
0163-5808(200006)29:2<165:XASFED>2.0.ZU;2-P
Abstract
XML is rapidly emerging as the new standard for data representation and exc hange on the Web. An XML document can be accompanied by a Document Type Des criptor (DTD) which plays the role of a schema for an XML data collection. DTDs contain valuable information on the structure of documents and thus ha ve a crucial role in the efficient storage of XML data, as well as the effe ctive formulation and optimization of XML, queries. In this paper, we propo se XTRACT, a novel system for inferring a DTD schema for a database of XML documents. Since the DTD syntax incorporates the full expressive power of r egular expressions, naive approaches typically fail to produce concise and intuitive DTDs. Instead, the XTRACT inference algorithms employ a sequence of sophisticated steps that involve: (1) finding patterns in the input sequ ences and replacing them with regular expressions to generate "general" can didate DTDs, (2) factoring candidate DTDs using adaptations of algorithms f rom the logic optimization literature, and (3) applying the Minimum Descrip tion Length (MDL) principle to find the best DTD among the candidates. The results of our experiments with real-life and synthetic DTDs demonstrate th e effectiveness of XTRACT's approach in inferring concise and semantically meaningful DTD schemas for XML databases.