ITA
ENG

A model for estimating the occurrence of same-frequency words and the boundary between high- and low-frequency words in texts

Authors

Sun, QL Shaw, D Davis, CH

Citation

Ql. Sun et al., A model for estimating the occurrence of same-frequency words and the boundary between high- and low-frequency words in texts, J AM S INFO, 50(3), 1999, pp. 280-286

Citations number

Categorie Soggetti

Library & Information Science

Journal title

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE

ISSN journal

00028231 → ACNP

Volume

Issue

Year of publication

1999

Pages

280 - 286

Database

ISI

SICI code

0002-8231(199903)50:3<280:AMFETO>2.0.ZU;2-5

Abstract

A simpler model is proposed for estimating the frequency of any same-freque ncy words and identifying the boundary point between high-frequency words a nd low-frequency words in a text, The model, based on a "maximum ranking me thod," assigns ranks to the words and estimates word frequency by the formu la: Int[(-1 + (1 + 4D/In+1)(1/2))/2] > n* Int[(-1 + (1 + 4D/I-n)(1/2))/2]. The boundary value between high-frequency and low-frequency words is obtain ed by taking the square root of the number of different words in the text: n* = (D)(1/2). This straightforward model was used successfully with both E nglish and Chinese texts, demonstrating that the frequency of words and the number of same-frequency words are dependent only on the vocabulary of a t ext (the number of different words) but not on its length. Like Zipf's Law, the model may be universally applicable.