ITA
ENG

DEVELOPMENT OF DOCUMENT ARCHITECTURE EXTRACTION

Authors

DOI M FUKUI M TAMAGUCHI K TAKEBYASHI Y IWAI I

Citation

M. Doi et al., DEVELOPMENT OF DOCUMENT ARCHITECTURE EXTRACTION, Systems and computers in Japan, 25(9), 1994, pp. 67-82

Citations number

Categorie Soggetti

Computer Science Hardware & Architecture","Computer Science Information Systems","Computer Science Theory & Methods

Journal title

Systems and computers in Japan → ACNP

ISSN journal

08821666

Volume

Issue

Year of publication

1994

Pages

67 - 82

Database

ISI

SICI code

0882-1666(1994)25:9<67:DODAE>2.0.ZU;2-1

Abstract

The purpose of this study is the reduction of the burden in the docume nt structurization process. A technique is presented for extracting th e document architecture. As the technical document, 12,000 articles ar e extracted from the proceedings of a national convention. A summary o f sample sentences as well as approximately 500 office documents withi n the organization also are examined as business documents. The rules for extracting the architecture are derived. The technique developed f or document architecture extraction can extract such hierarchical stru ctures as chapters and sections, as well as the reference structure to figures and tables from the technical document. The technique can als o extract the hierarchical structure such as communications and report s from the business document. The technical and business documents can be discriminated by analyzing the character strings. As a result of e valuation using proceedings and inoffice documents other than those us ed for deriving the rules, the error rate is 10.0 percent for the tech nical document and 23.0 percent for the business document. The error i n extracting the reference structure is 8 percent. A field test is exe cuted after improving the method so that the equations, figures and ta bles embedded in the text can be handled. The error rate is 5.4 percen t for the technical document and 15.4 percent for the business documen t. It is verified through examples that the structurization can be ach ieved in a considerably shorter time than by manual processing. The de veloped document architecture extraction technique is commercialized a s an automatic system by combining the technique with the layout attri bute. The developed extraction technique will be utilized effectively in the hypertext conversion of the existing document and other problem s, in addition to the layout processing.