Exact distribution of word counts in shuffled sequences

Citation
Rødland, Einar Andreas, Exact distribution of word counts in shuffled sequences, Advances in applied probability , 38(1), 2006, pp. 116-133
ISSN journal
00018678
Volume
38
Issue
1
Year of publication
2006
Pages
116 - 133
Database
ACNP
SICI code
Abstract
In DNA sequences, specific words may take on biological functions as marker or signalling sequences. These may often be identified by frequent-word analyses as being particularly abundant. Accurate statistics is needed to assess the statistical significance of these word frequencies. The set of shuffled sequences - letter sequences having the same k-word composition, for some choice of k, as the sequence being analysed - is considered the most appropriate sample space for analysing word counts. However, little is known about these word counts. Here we present exact formulae for word counts in shuffled sequences.