This paper describes a knowledge-based system for classifying document
s based upon the layout structure and conceptual information extracted
from their contents. The spatial elements in a document are laid out
in rectangular blocks which are represented by nodes in an ordered lab
eled tree, called the ''Layout Structure Tree'' (L-S Tree). Each leaf
node of an L-S Tree points to its corresponding block content. A Knowl
edge Acquisition Tool (KAT) is devised to perform the inductive learni
ng from L-S Trees of document samples, and then generate the Document
Sample Tree and Document Type Tree bases. A testing document is classi
fied if a Document Type Tree is discovered as a substructure of the L-
S Tree of the testing document. Then we match the L-S Tree with the Do
cument Sample Trees of the classified document type to find the format
of the testing document. The Document Sample Trees and Document Type
Trees are called Structural Knowledge Base (SKB). The tree discovering
and matching processes involve comparing the SKB trees and a testing
document's L-S Tree by using pattern matching and discovering toolkits
. Our experimental results demonstrate that many office documents can
be classified correctly using the proposed approach. (C) Elsevier Scie
nce Inc. 1997.