Jc. Wootton, NONGLOBULAR DOMAINS IN PROTEIN SEQUENCES - AUTOMATED SEGMENTATION USING COMPLEXITY-MEASURES, Computers & chemistry, 18(3), 1994, pp. 269-285
Computational methods based on mathematically-defined measures of comp
ositional complexity have been developed to distinguish globular and n
on-globular regions of protein sequences. Compact globular structures
in protein molecules are shown to be determined by amino acid sequence
s of high informational complexity. Sequences of known crystal structu
re in the Brookhaven Protein Data Bank differ only slightly from rando
mly shuffled sequences in the distribution of statistical properties s
uch as local compositional complexity. In contrast, in the much larger
body of deduced sequences in the SWISS-PROT database, approximately o
ne quarter of the residues occur in segments of non-randomly low compl
exity and approximately half of the entries contain at least one such
segment. Sequences of proteins with known, physicochemically-defined n
on-globular regions have been analyzed, including collagens, different
classes of coiled-coil proteins, elastins, histones, non-histone prot
eins, mucins, proteoglycan core proteins and proteins containing long
single solvent-exposed alpha-helices. The SEG algorithm provides an ef
fective general method for partitioning the globular and non-globular
regions of these sequences fully automatically. This method is also fa
cilitating the discovery of new classes of long, non-globular sequence
segments, as illustrated by the example of the human CAN gene product
involved in tumor induction.