STATISTICS OF LOCAL COMPLEXITY IN AMINO-ACID-SEQUENCES AND SEQUENCE DATABASES

Citation
Jc. Wootton et S. Federhen, STATISTICS OF LOCAL COMPLEXITY IN AMINO-ACID-SEQUENCES AND SEQUENCE DATABASES, Computers & chemistry, 17(2), 1993, pp. 149-163
Citations number
28
Categorie Soggetti
Computer Application, Chemistry & Engineering","Computer Applications & Cybernetics",Chemistry
Journal title
ISSN journal
00978485
Volume
17
Issue
2
Year of publication
1993
Pages
149 - 163
Database
ISI
SICI code
0097-8485(1993)17:2<149:SOLCIA>2.0.ZU;2-Q
Abstract
Protein sequences contain surprisingly many local regions of low compo sitional complexity. These include different types of residue clusters , some of which contain homopolymers, short period repeats or aperiodi c mosaics of a few residue types. Several different formal definitions of local complexity and probability are presented here and are compar ed for their utility in algorithms for localization of such regions in amino acid sequences and sequence databases. The definitions are:-(1) those derived from enumeration a priori by a treatment analogous to s tatistical mechanics, (2) a log likelihood definition of complexity an alogous to informational entropy, (3) multinomial probabilities of obs erved compositions, (4) an approximation resembling the chi2 statistic and (5) a modification of the coefficient of divergence. These measur es, together with a method based on similarity scores of self-aligned sequences at different offsets, are shown to be broadly similar for fi rst-pass, approximate localization of low-complexity regions in protei n sequences, but they give significantly different results when applie d in optimal segmentation algorithms. These comparisons underpin the c hoice of robust optimization heuristics in an algorithm, SEG, designed to segment amino acid sequences fully automatically into subsequences of contrasting complexity. After the abundant low-complexity segments have been partitioned from the Swissprot database, the remaining high -complexity sequence set is adequately approximated by a first-order r andom model.