ITA
ENG

CATEGORIZING PAPER DOCUMENTS - A GENERIC SYSTEM FOR DOMAIN AND LANGUAGE INDEPENDENT TEXT CATEGORIZATION

Authors

BAYER T KRESSEL U MOGGSCHNEIDER H RENZ I

Citation

T. Bayer et al., CATEGORIZING PAPER DOCUMENTS - A GENERIC SYSTEM FOR DOMAIN AND LANGUAGE INDEPENDENT TEXT CATEGORIZATION, Computer vision and image understanding, 70(3), 1998, pp. 299-306

Citations number

Categorie Soggetti

Computer Science Software Graphycs Programming","Computer Science Software Graphycs Programming

Journal title

Computer vision and image understanding → ACNP

ISSN journal

10773142

Volume

Issue

Year of publication

1998

Pages

299 - 306

Database

ISI

SICI code

1077-3142(1998)70:3<299:CPD-AG>2.0.ZU;2-U

Abstract

Text categorization assigns predefined categories to either electronic ally available texts or those resulting from document image analysis. A generic system for text categorization is presented which is based o n statistical analysis of representative text corpora. Significant fea tures are automatically derived from training texts by selecting subst rings from actual word forms and applying statistical information and general linguistic knowledge. The dimension of the feature vectors is then reduced by linear transformation, keeping the essential informati on. The classification is a minimum least-squares approach based on po lynomials. The described system can be efficiently adapted to new doma ins or different languages. In application, the adapted text categoriz ers are reliable, fast, and completely automatic. Two example categori zation tasks achieve recognition scores of approximately 80% and are v ery robust against recognition or typing errors. (C) 1999 Academic Pre ss.