Corpus-based statistical screening for content-bearing terms

Authors
Citation
W. Kim et Wj. Wilbur, Corpus-based statistical screening for content-bearing terms, J AM SOC IN, 52(3), 2001, pp. 247-259
Citations number
33
Categorie Soggetti
Library & Information Science
Journal title
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY
ISSN journal
15322882 → ACNP
Volume
52
Issue
3
Year of publication
2001
Pages
247 - 259
Database
ISI
SICI code
1532-2882(20010201)52:3<247:CSSFCT>2.0.ZU;2-A
Abstract
An important problem in the indexing of natural language text is how to ide ntify those words and phrases that reflect the content of the text. In gene ral, automatic indexing has dealt with this problem by removing instances o f a few hundred common words known as stop words, and treating the remainin g words as though they were content bearing. This approach is acceptable fo r some applications such as statistical estimates of the similarity of quer ies and documents for the purpose of document retrieval. However, when the indexing terms are to be examined by a human as a means of accessing the li terature, it greatly improves efficiency if most of the noncontent-bearing words and phrases can be eliminated from the indexing, Here we present thre e statistical techniques for identifying content-bearing phrases within a n atural language database. We demonstrate the effectiveness of the methods o n test data, and show how all three methods can be combined to produce a si ngle improved method.