XML is rapidly emerging as the new standard for data representation and exc
hange on the Web. An XML document can be accompanied by a Document Type Des
criptor (DTD) which plays the role of a schema for an XML data collection.
DTDs contain valuable information on the structure of documents and thus ha
ve a crucial role in the efficient storage of XML data, as well as the effe
ctive formulation and optimization of XML, queries. In this paper, we propo
se XTRACT, a novel system for inferring a DTD schema for a database of XML
documents. Since the DTD syntax incorporates the full expressive power of r
egular expressions, naive approaches typically fail to produce concise and
intuitive DTDs. Instead, the XTRACT inference algorithms employ a sequence
of sophisticated steps that involve: (1) finding patterns in the input sequ
ences and replacing them with regular expressions to generate "general" can
didate DTDs, (2) factoring candidate DTDs using adaptations of algorithms f
rom the logic optimization literature, and (3) applying the Minimum Descrip
tion Length (MDL) principle to find the best DTD among the candidates. The
results of our experiments with real-life and synthetic DTDs demonstrate th
e effectiveness of XTRACT's approach in inferring concise and semantically
meaningful DTD schemas for XML databases.