Tree patterns form a natural basis to query tree-structured data such as XM
L and LDAP. Since the efficiency of tree pattern matching against a tree-st
ructured database depends on the size of the pattern, it is essential to id
entify and eliminate redundant nodes in the pattern and do so as quickly as
possible. In this paper, we study tree pattern minimization both in the ab
sence and in the presence of integrity constraints (ICs) on the underlying
tree-structured database.
When no ICs are considered, we call the process of minimizing a tree patter
n, constraint-independent minimization. We develop a polynomial time algori
thm called CIM for this purpose. CIM's efficiency stems from two key proper
ties: (i) a node cannot be redundant unless its children are, and (ii) the
order of elimination of redundant nodes is immaterial. When ICs are conside
red for minimization, we refer to it as constraint-dependent minimization.
For tree-structured databases, required child/descendant and type co-occurr
ence ICs are very natural. Under such ICs, we show that the minimal equival
ent query is unique. We show the surprising result that the algorithm obtai
ned by first augmenting the tree pattern using ICs, and then applying CIM,
always finds the unique minimal equivalent query; we refer to this algorith
m as ACIM. While ACIM is also polynomial time, it can be expensive in pract
ice because of its inherent non-locality. We then present a fast algorithm,
CDM, that identifies and eliminates local redundancies due to ICs, based o
n propagating "information labels" up the tree pattern. CDM can be applied
prior to ACIM for improving the minimization efficiency. We complement our
analytical results with an experimental study that shows the effectiveness
of our tree pattern minimization techniques.