DEVELOPMENT OF DOCUMENT ARCHITECTURE EXTRACTION

Citation
M. Doi et al., DEVELOPMENT OF DOCUMENT ARCHITECTURE EXTRACTION, Systems and computers in Japan, 25(9), 1994, pp. 67-82
Citations number
9
Categorie Soggetti
Computer Science Hardware & Architecture","Computer Science Information Systems","Computer Science Theory & Methods
ISSN journal
08821666
Volume
25
Issue
9
Year of publication
1994
Pages
67 - 82
Database
ISI
SICI code
0882-1666(1994)25:9<67:DODAE>2.0.ZU;2-1
Abstract
The purpose of this study is the reduction of the burden in the docume nt structurization process. A technique is presented for extracting th e document architecture. As the technical document, 12,000 articles ar e extracted from the proceedings of a national convention. A summary o f sample sentences as well as approximately 500 office documents withi n the organization also are examined as business documents. The rules for extracting the architecture are derived. The technique developed f or document architecture extraction can extract such hierarchical stru ctures as chapters and sections, as well as the reference structure to figures and tables from the technical document. The technique can als o extract the hierarchical structure such as communications and report s from the business document. The technical and business documents can be discriminated by analyzing the character strings. As a result of e valuation using proceedings and inoffice documents other than those us ed for deriving the rules, the error rate is 10.0 percent for the tech nical document and 23.0 percent for the business document. The error i n extracting the reference structure is 8 percent. A field test is exe cuted after improving the method so that the equations, figures and ta bles embedded in the text can be handled. The error rate is 5.4 percen t for the technical document and 15.4 percent for the business documen t. It is verified through examples that the structurization can be ach ieved in a considerably shorter time than by manual processing. The de veloped document architecture extraction technique is commercialized a s an automatic system by combining the technique with the layout attri bute. The developed extraction technique will be utilized effectively in the hypertext conversion of the existing document and other problem s, in addition to the layout processing.