The complexity of large sets of non-redundant protein sequences is measured
. This is done by estimating the Shannon entropy as well as applying compre
ssion algorithms to estimate the algorithmic complexity. The estimators are
also applied to randomly generated surrogates of the protein data. Our res
ults show that proteins are fairly close to random sequences. The entropy r
eduction due to correlations is only about 1%. However, precise estimations
of the entropy of the source are not possible due to finite sample effects
. Compression algorithms also indicate that the redundancy is in the order
of 1%. These results confirm the idea that protein sequences can be regarde
d as slightly edited random strings. We discuss secondary structure and low
-complexity regions as causes of the redundancy observed. The findings are
related to numerical and biochemical experiments with random polypeptides.
(C) 2000 Academic Press.