Jc. Wootton et S. Federhen, STATISTICS OF LOCAL COMPLEXITY IN AMINO-ACID-SEQUENCES AND SEQUENCE DATABASES, Computers & chemistry, 17(2), 1993, pp. 149-163
Protein sequences contain surprisingly many local regions of low compo
sitional complexity. These include different types of residue clusters
, some of which contain homopolymers, short period repeats or aperiodi
c mosaics of a few residue types. Several different formal definitions
of local complexity and probability are presented here and are compar
ed for their utility in algorithms for localization of such regions in
amino acid sequences and sequence databases. The definitions are:-(1)
those derived from enumeration a priori by a treatment analogous to s
tatistical mechanics, (2) a log likelihood definition of complexity an
alogous to informational entropy, (3) multinomial probabilities of obs
erved compositions, (4) an approximation resembling the chi2 statistic
and (5) a modification of the coefficient of divergence. These measur
es, together with a method based on similarity scores of self-aligned
sequences at different offsets, are shown to be broadly similar for fi
rst-pass, approximate localization of low-complexity regions in protei
n sequences, but they give significantly different results when applie
d in optimal segmentation algorithms. These comparisons underpin the c
hoice of robust optimization heuristics in an algorithm, SEG, designed
to segment amino acid sequences fully automatically into subsequences
of contrasting complexity. After the abundant low-complexity segments
have been partitioned from the Swissprot database, the remaining high
-complexity sequence set is adequately approximated by a first-order r
andom model.