ESTIMATING THE ENTROPY OF DNA-SEQUENCES

Citation
Ao. Schmitt et H. Herzel, ESTIMATING THE ENTROPY OF DNA-SEQUENCES, Journal of theoretical biology, 188(3), 1997, pp. 369-377
Citations number
25
Categorie Soggetti
Biology Miscellaneous
ISSN journal
00225193
Volume
188
Issue
3
Year of publication
1997
Pages
369 - 377
Database
ISI
SICI code
0022-5193(1997)188:3<369:ETEOD>2.0.ZU;2-I
Abstract
The Shannon entropy is a standard measure for the order state of symbo l sequences, such as, for example, DNA sequences. In order to incorpor ate correlations between symbols, the entropy of n-mers (consecutive s trands of n symbols) has to be determined. Here, an assay is presented to estimate such higher order entropies (block entropies) for DNA seq uences when the actual number of observations is small compared with t he number of possible outcomes. The n-mer probability distribution und erlying the dynamical process is reconstructed using elementary statis tical principles: The theorem of asymptotic equi-distribution and the Maximum Entropy Principle. Constraints are set to force the constructe d distributions to adopt features which are characteristic for the rea l probability distribution. From the many solutions compatible with th ese constraints the one with the highest entropy is the most likely on e according to the Maximum Entropy Principle. An algorithm performing this procedure is expounded. It is tested by applying it to various DN A model sequences whose exact entropies are known. Finally, results fo r a real DNA sequence, the complete genome of the Epstein Parr virus, are presented and compared with those of other information carriers (t exts, computer source code, music). It seems as if DNA sequences posse ss much more freedom in the combination of the symbols of their alphab et than written language or computer source codes. (C) 1997 Academic P ress Limited.