STATISTICAL-MODELS FOR WORD-FREQUENCY DISTRIBUTIONS - A LINGUISTIC EVALUATION

Authors
Citation
H. Baayen, STATISTICAL-MODELS FOR WORD-FREQUENCY DISTRIBUTIONS - A LINGUISTIC EVALUATION, Computers and the humanities, 26(5-6), 1992, pp. 347-363
Citations number
51
Categorie Soggetti
Art & Humanities General","Computer Sciences, Special Topics","Computer Applications & Cybernetics
ISSN journal
00104817
Volume
26
Issue
5-6
Year of publication
1992
Pages
347 - 363
Database
ISI
SICI code
0010-4817(1992)26:5-6<347:SFWD-A>2.0.ZU;2-Z
Abstract
Three models for word frequency distributions, the lognormal law, the generalized inverse Gauss-Poisson law and the extended generalized Zip f's law are compared and evaluated with respect to goodness of fit and rationale. Application of these models to frequency distributions of a text, a corpus and morphological data reveals that no model can lay claim to exclusive validity, while inspection of the extrapolated theo retical vocabulary sizes raises doubts as to whether the urn scheme wi th independent trials is the correct underlying model for word frequen cy data. The role of morphology in shaping word frequency distribution s is discussed, as well as parallelisms between vocabulary richness in literary studies and morphological productivity in linguistics.