ENTROPY OF NATURAL LANGUAGES - THEORY AND EXPERIMENT

Citation
Lb. Levitin et Z. Reingold, ENTROPY OF NATURAL LANGUAGES - THEORY AND EXPERIMENT, Chaos, solitons and fractals, 4(5), 1994, pp. 709-743
Citations number
50
Categorie Soggetti
Mathematics,Mechanics,Engineering,"Physics, Applied
ISSN journal
09600779
Volume
4
Issue
5
Year of publication
1994
Pages
709 - 743
Database
ISI
SICI code
0960-0779(1994)4:5<709:EONL-T>2.0.ZU;2-T
Abstract
The concept of the entropy of natural languages, first introduced by S hannon [A mathematical theory of communications, Bell Syst. Tech. J. 2 7, 379-423 (1948)] and its significance is discussed. A review of vari ous known approaches to and results of previous studies of language en tropy is presented. A new improved method for evaluation of both lower and upper bounds of the entropy of printed texts is developed. This m ethod is a refinement of Shannon's prediction (guessing) method [Shann on, Prediction and entropy of printed English, Bell Syst. Tech: J. 30, 50-64 (1951)]. The evaluation of the lower bound is shown to be a cla ssical linear programming problem. Statistical analysis of the estimat ion of the bounds is given and procedures for the statistical treatmen t of the experimental data (including verification of statistical vali dity and significance) are elaborated. The method has been applied to printed Hebrew texts in a large experiment (1000 independent samples) in order to evaluate entropy and other information-theoretical charact eristics of the Hebrew language. The results have demonstrated the eff iciency of the new method: the gap between the upper and lower bounds of entropy has been reduced by a factor of 2.25 compared to the origin al Shannon approach. Comparison with other languages is given. Possibl e applications of the method are briefly discussed.