Ym. Yang et J. Wilbur, USING CORPUS STATISTICS TO REMOVE REDUNDANT WORDS IN TEXT CATEGORIZATION, Journal of the American Society for Information Science, 47(5), 1996, pp. 357-369
Citations number
28
Categorie Soggetti
Information Science & Library Science","Information Science & Library Science
This article studies aggressive word removal in text categorization to
reduce the noise in free texts and to enhance the computational effic
iency of categorization. We use a novel stop word identification metho
d to automatically generate domain specific stoplists which are much l
arger than a conventional domain-independent stoplist. In our tests wi
th three categorization methods on text collections from different dom
ains/applications, significant numbers of words were removed without s
acrificing categorization effectiveness. In the test of the Expert Net
work method on CACM documents, for example, an 87% removal of unique w
ords reduced the vocabulary of documents from 8,002 distinct words to
1,045 words, which resulted in a 63% time savings and a 74% memory sav
ings in the computation of category ranking, with a 10% precision impr
ovement on average over not using word removal. It is evident in this
study that automated word removal based on corpus statistics has a pra
ctical and significant impact on the computational tractability of cat
egorization methods in large databases.