CATEGORIZING PAPER DOCUMENTS - A GENERIC SYSTEM FOR DOMAIN AND LANGUAGE INDEPENDENT TEXT CATEGORIZATION

Citation
T. Bayer et al., CATEGORIZING PAPER DOCUMENTS - A GENERIC SYSTEM FOR DOMAIN AND LANGUAGE INDEPENDENT TEXT CATEGORIZATION, Computer vision and image understanding, 70(3), 1998, pp. 299-306
Citations number
20
Categorie Soggetti
Computer Science Software Graphycs Programming","Computer Science Software Graphycs Programming
ISSN journal
10773142
Volume
70
Issue
3
Year of publication
1998
Pages
299 - 306
Database
ISI
SICI code
1077-3142(1998)70:3<299:CPD-AG>2.0.ZU;2-U
Abstract
Text categorization assigns predefined categories to either electronic ally available texts or those resulting from document image analysis. A generic system for text categorization is presented which is based o n statistical analysis of representative text corpora. Significant fea tures are automatically derived from training texts by selecting subst rings from actual word forms and applying statistical information and general linguistic knowledge. The dimension of the feature vectors is then reduced by linear transformation, keeping the essential informati on. The classification is a minimum least-squares approach based on po lynomials. The described system can be efficiently adapted to new doma ins or different languages. In application, the adapted text categoriz ers are reliable, fast, and completely automatic. Two example categori zation tasks achieve recognition scores of approximately 80% and are v ery robust against recognition or typing errors. (C) 1999 Academic Pre ss.