ITA
ENG

Probabilistic and statistical properties of words: An overview

Authors

Reinert, G Schbath, S Waterman, MS

Citation

G. Reinert et al., Probabilistic and statistical properties of words: An overview, J COMPUT BI, 7(1-2), 2000, pp. 1-46

Citations number

Categorie Soggetti

Biochemistry & Biophysics

Journal title

JOURNAL OF COMPUTATIONAL BIOLOGY

ISSN journal

10665277 → ACNP

Volume

Issue

1-2

Year of publication

2000

Pages

1 - 46

Database

ISI

SICI code

1066-5277(200002/04)7:1-2<1:PASPOW>2.0.ZU;2-X

Abstract

In the following, an overview is given on statistical and probabilistic pro perties of words, as occurring in the analysis of biological sequences. Cou nts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process a pproximations, and compound Poisson approximations are derived. Here, a seq uence is modelled as a stationary ergodic Markov chain; a test for determin ing the appropriate order of the Markov chain is described. The convergence results take the error made by estimating the Markovian transition probabi lities into account, The main tools involved are moment generating function s, martingales, Stein's method, and the Chen-Stein method. Similar results are given for occurrences of multiple patterns, and, as an example, the pro blem of unique recoverability of a sequence from SBH chip data is discussed , Special emphasis lies on disentangling the complicated dependence structu re between word occurrences, due to self-overlap as well as due to overlap between words. The results can be used to derive approximate, and conservat ive, confidence intervals for tests.