ITA
ENG

Corpus-based statistical screening for content-bearing terms

Authors

Kim, W Wilbur, WJ

Citation

W. Kim et Wj. Wilbur, Corpus-based statistical screening for content-bearing terms, J AM SOC IN, 52(3), 2001, pp. 247-259

Citations number

Categorie Soggetti

Library & Information Science

Journal title

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY

ISSN journal

15322882 → ACNP

Volume

Issue

Year of publication

2001

Pages

247 - 259

Database

ISI

SICI code

1532-2882(20010201)52:3<247:CSSFCT>2.0.ZU;2-A

Abstract

An important problem in the indexing of natural language text is how to ide ntify those words and phrases that reflect the content of the text. In gene ral, automatic indexing has dealt with this problem by removing instances o f a few hundred common words known as stop words, and treating the remainin g words as though they were content bearing. This approach is acceptable fo r some applications such as statistical estimates of the similarity of quer ies and documents for the purpose of document retrieval. However, when the indexing terms are to be examined by a human as a means of accessing the li terature, it greatly improves efficiency if most of the noncontent-bearing words and phrases can be eliminated from the indexing, Here we present thre e statistical techniques for identifying content-bearing phrases within a n atural language database. We demonstrate the effectiveness of the methods o n test data, and show how all three methods can be combined to produce a si ngle improved method.