The purpose of this study is the reduction of the burden in the docume
nt structurization process. A technique is presented for extracting th
e document architecture. As the technical document, 12,000 articles ar
e extracted from the proceedings of a national convention. A summary o
f sample sentences as well as approximately 500 office documents withi
n the organization also are examined as business documents. The rules
for extracting the architecture are derived. The technique developed f
or document architecture extraction can extract such hierarchical stru
ctures as chapters and sections, as well as the reference structure to
figures and tables from the technical document. The technique can als
o extract the hierarchical structure such as communications and report
s from the business document. The technical and business documents can
be discriminated by analyzing the character strings. As a result of e
valuation using proceedings and inoffice documents other than those us
ed for deriving the rules, the error rate is 10.0 percent for the tech
nical document and 23.0 percent for the business document. The error i
n extracting the reference structure is 8 percent. A field test is exe
cuted after improving the method so that the equations, figures and ta
bles embedded in the text can be handled. The error rate is 5.4 percen
t for the technical document and 15.4 percent for the business documen
t. It is verified through examples that the structurization can be ach
ieved in a considerably shorter time than by manual processing. The de
veloped document architecture extraction technique is commercialized a
s an automatic system by combining the technique with the layout attri
bute. The developed extraction technique will be utilized effectively
in the hypertext conversion of the existing document and other problem
s, in addition to the layout processing.