Approximate word matches between two random sequences

Citation
J. Burden, Conrad et al., Approximate word matches between two random sequences, Annals of applied probability , 18(1), 2008, pp. 1-21
ISSN journal
10505164
Volume
18
Issue
1
Year of publication
2008
Pages
1 - 21
Database
ACNP
SICI code
Abstract
Given two sequences over a finite alphabet L, the D2 statistic is the number of m-letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the D2 statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. For k<m, we look at the count of m-letter word matches with up to k mismatches. For this statistic, we compute the expectation, give upper and lower bounds for the variance and prove its distribution is asymptotically normal.