ITA
ENG

STATISTICAL-MODELS FOR WORD-FREQUENCY DISTRIBUTIONS - A LINGUISTIC EVALUATION

Authors

BAAYEN H

Citation

H. Baayen, STATISTICAL-MODELS FOR WORD-FREQUENCY DISTRIBUTIONS - A LINGUISTIC EVALUATION, Computers and the humanities, 26(5-6), 1992, pp. 347-363

Citations number

Categorie Soggetti

Art & Humanities General","Computer Sciences, Special Topics","Computer Applications & Cybernetics

Journal title

Computers and the humanities → ACNP

ISSN journal

00104817

Volume

Issue

5-6

Year of publication

1992

Pages

347 - 363

Database

ISI

SICI code

0010-4817(1992)26:5-6<347:SFWD-A>2.0.ZU;2-Z

Abstract

Three models for word frequency distributions, the lognormal law, the generalized inverse Gauss-Poisson law and the extended generalized Zip f's law are compared and evaluated with respect to goodness of fit and rationale. Application of these models to frequency distributions of a text, a corpus and morphological data reveals that no model can lay claim to exclusive validity, while inspection of the extrapolated theo retical vocabulary sizes raises doubts as to whether the urn scheme wi th independent trials is the correct underlying model for word frequen cy data. The role of morphology in shaping word frequency distribution s is discussed, as well as parallelisms between vocabulary richness in literary studies and morphological productivity in linguistics.