T. Bayer et al., CATEGORIZING PAPER DOCUMENTS - A GENERIC SYSTEM FOR DOMAIN AND LANGUAGE INDEPENDENT TEXT CATEGORIZATION, Computer vision and image understanding, 70(3), 1998, pp. 299-306
Text categorization assigns predefined categories to either electronic
ally available texts or those resulting from document image analysis.
A generic system for text categorization is presented which is based o
n statistical analysis of representative text corpora. Significant fea
tures are automatically derived from training texts by selecting subst
rings from actual word forms and applying statistical information and
general linguistic knowledge. The dimension of the feature vectors is
then reduced by linear transformation, keeping the essential informati
on. The classification is a minimum least-squares approach based on po
lynomials. The described system can be efficiently adapted to new doma
ins or different languages. In application, the adapted text categoriz
ers are reliable, fast, and completely automatic. Two example categori
zation tasks achieve recognition scores of approximately 80% and are v
ery robust against recognition or typing errors. (C) 1999 Academic Pre
ss.