USING CORPUS STATISTICS TO REMOVE REDUNDANT WORDS IN TEXT CATEGORIZATION

Authors
Citation
Ym. Yang et J. Wilbur, USING CORPUS STATISTICS TO REMOVE REDUNDANT WORDS IN TEXT CATEGORIZATION, Journal of the American Society for Information Science, 47(5), 1996, pp. 357-369
Citations number
28
Categorie Soggetti
Information Science & Library Science","Information Science & Library Science
ISSN journal
00028231
Volume
47
Issue
5
Year of publication
1996
Pages
357 - 369
Database
ISI
SICI code
0002-8231(1996)47:5<357:UCSTRR>2.0.ZU;2-0
Abstract
This article studies aggressive word removal in text categorization to reduce the noise in free texts and to enhance the computational effic iency of categorization. We use a novel stop word identification metho d to automatically generate domain specific stoplists which are much l arger than a conventional domain-independent stoplist. In our tests wi th three categorization methods on text collections from different dom ains/applications, significant numbers of words were removed without s acrificing categorization effectiveness. In the test of the Expert Net work method on CACM documents, for example, an 87% removal of unique w ords reduced the vocabulary of documents from 8,002 distinct words to 1,045 words, which resulted in a 63% time savings and a 74% memory sav ings in the computation of category ranking, with a 10% precision impr ovement on average over not using word removal. It is evident in this study that automated word removal based on corpus statistics has a pra ctical and significant impact on the computational tractability of cat egorization methods in large databases.