This paper presents a modular software system, which classifies a larg
e variety of office documents according to layout form and textual con
tent. It consists of the following components: layout analysis, pre-cl
assification, OCR interface, fuzzy string matching, text categorizatio
n, lexical, syntactical and semantic analysis. The system has been app
lied to the following tasks: presorting of forms, reports and letters,
index extraction for archiving and retrieval, page type classificatio
n and text column analysis of real estate register documents, in-house
mail sorting and electronic distribution to departments. The architec
ture, modules, and practical results are described. (C) 1997 Elsevier
Science B.V.