P. Bernaolagalvan et al., COMPOSITIONAL SEGMENTATION AND LONG-RANGE FRACTAL CORRELATIONS IN DNA-SEQUENCES, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics, 53(5), 1996, pp. 5181-5189
A segmentation algorithm based on the Jensen-Shannon entropic divergen
ce is used to decompose long-range correlated DNA sequences into stati
stically significant, compositionally homogeneous patches. By adequate
ly setting the significance level for segmenting the sequence, the und
erlying power-law distribution of patch lengths can be revealed. Some
of the identified DNA domains were uncorrelated, but most of them cont
inued to display long-range correlations even after several steps of r
ecursive segmentation, thus indicating a complex multi-length-scaled s
tructure for the sequence. On the other hand, by separately shuffling
each segment, or by randomly rearranging the order in which the differ
ent segments occur in the sequence, shuffled sequences preserving the
original statistical distribution of patch lengths were generated. Bot
h types of random sequences displayed the same correlation scaling exp
onents as the original DNA sequence, thus demonstrating that neither t
he internal structure of patches nor the order in which these are arra
nged in the sequence is critical; therefore, long-range correlations i
n nucleotide sequences seem to rely only on the power-law distribution
of patch lengths.