Information content of protein sequences

Citation
O. Weiss et al., Information content of protein sequences, J THEOR BIO, 206(3), 2000, pp. 379-386
Citations number
41
Categorie Soggetti
Multidisciplinary
Journal title
JOURNAL OF THEORETICAL BIOLOGY
ISSN journal
00225193 → ACNP
Volume
206
Issue
3
Year of publication
2000
Pages
379 - 386
Database
ISI
SICI code
0022-5193(20001007)206:3<379:ICOPS>2.0.ZU;2-S
Abstract
The complexity of large sets of non-redundant protein sequences is measured . This is done by estimating the Shannon entropy as well as applying compre ssion algorithms to estimate the algorithmic complexity. The estimators are also applied to randomly generated surrogates of the protein data. Our res ults show that proteins are fairly close to random sequences. The entropy r eduction due to correlations is only about 1%. However, precise estimations of the entropy of the source are not possible due to finite sample effects . Compression algorithms also indicate that the redundancy is in the order of 1%. These results confirm the idea that protein sequences can be regarde d as slightly edited random strings. We discuss secondary structure and low -complexity regions as causes of the redundancy observed. The findings are related to numerical and biochemical experiments with random polypeptides. (C) 2000 Academic Press.