Given a text string of length n and a pattern string of length m over
a b-letter alphabet, the k differences approximate string matching pro
blem asks for all locations in the text where the pattern occurs with
at most k differences (substitutions, insertions, deletions). We treat
k not as a constant but as a fraction of m (not necessarily constant-
fraction). Previous algorithms require at least 0(kn) time (or exponen
tial space). We give an algorithm that is sublinear time O((n/m)k log(
b) m) when the text is random and k is bounded by the threshold m/(log
(b) m + 0(1)). In particular, when k = o(m/log(b) m) the expected runn
ing time is o(n). In the worst case our algorithm is 0(kn), but is sti
ll an improvement in that it is practical and uses 0(m) space compared
with 0(n) or 0(m2). We define three problems motivated by molecular b
iology and describe efficient algorithms based on our techniques: (1)
approximate substring matching, (2) approximate-overlap detection, and
(3) approximate codon matching. Respectively, applications to biology
are local similarity search, sequence assembly, and DNA-protein match
ing.