ITA
ENG

KNOWLEDGE DISCOVERING FOR DOCUMENT CLASSIFICATION USING TREE MATCHINGIN TEXPROS

Authors

WEI CS LIU QH WANG JTL NG PA

Citation

Cs. Wei et al., KNOWLEDGE DISCOVERING FOR DOCUMENT CLASSIFICATION USING TREE MATCHINGIN TEXPROS, Information sciences, 100(1-4), 1997, pp. 255-310

Citations number

Categorie Soggetti

Information Science & Library Science","Computer Science Information Systems

Journal title

Information sciences → ACNP

ISSN journal

00200255

Volume

100

Issue

1-4

Year of publication

1997

Pages

255 - 310

Database

ISI

SICI code

0020-0255(1997)100:1-4<255:KDFDCU>2.0.ZU;2-8

Abstract

This paper describes a knowledge-based system for classifying document s based upon the layout structure and conceptual information extracted from their contents. The spatial elements in a document are laid out in rectangular blocks which are represented by nodes in an ordered lab eled tree, called the ''Layout Structure Tree'' (L-S Tree). Each leaf node of an L-S Tree points to its corresponding block content. A Knowl edge Acquisition Tool (KAT) is devised to perform the inductive learni ng from L-S Trees of document samples, and then generate the Document Sample Tree and Document Type Tree bases. A testing document is classi fied if a Document Type Tree is discovered as a substructure of the L- S Tree of the testing document. Then we match the L-S Tree with the Do cument Sample Trees of the classified document type to find the format of the testing document. The Document Sample Trees and Document Type Trees are called Structural Knowledge Base (SKB). The tree discovering and matching processes involve comparing the SKB trees and a testing document's L-S Tree by using pattern matching and discovering toolkits . Our experimental results demonstrate that many office documents can be classified correctly using the proposed approach. (C) Elsevier Scie nce Inc. 1997.