INTERPRETING CORRELATIONS IN BIOSEQUENCES

Citation
H. Herzel et al., INTERPRETING CORRELATIONS IN BIOSEQUENCES, Physica. A, 249(1-4), 1998, pp. 449-459
Citations number
40
Categorie Soggetti
Physics
Journal title
ISSN journal
03784371
Volume
249
Issue
1-4
Year of publication
1998
Pages
449 - 459
Database
ISI
SICI code
0378-4371(1998)249:1-4<449:ICIB>2.0.ZU;2-W
Abstract
Understanding the complex organization of genomes as well as predictin g the location of genes and the possible structure of the gene product s are some of the most important problems in current molecular biology . Many statistical techniques are used to address these issues. A cent ral role among them play correlation functions. This paper is based on an analysis of the decay of the entire 4 x 4 dimensional covariance m atrix of DNA sequences. We apply this covariance analysis to human chr omosomal regions, yeast DNA, and bacterial genomes and interpret the t hree most pronounced statistical features - long-range correlations, a period 3, and a period 10-11 - using known biological facts about the structure of genomes. For example, we relate the slowly decaying long -range G+C correlations to dispersed repeats and CpG islands. We show quantitatively that the 3-basepair-periodicity is due to the nonunifor mity of the codon usage in protein coding segments. We finally show th at periodicities of 10-11 basepairs in yeast DNA originate from an alt ernation of hydrophobic and hydrophilic amino acids in protein sequenc es. (C) 1998 Elsevier Science B.V. All rights reserved.