SUBLINEAR APPROXIMATE STRING-MATCHING AND BIOLOGICAL APPLICATIONS

Citation
Wi. Chang et El. Lawler, SUBLINEAR APPROXIMATE STRING-MATCHING AND BIOLOGICAL APPLICATIONS, Algorithmica, 12(4-5), 1994, pp. 327-344
Citations number
51
Categorie Soggetti
Computer Sciences",Mathematics,Mathematics,"Computer Science Software Graphycs Programming
Journal title
ISSN journal
01784617
Volume
12
Issue
4-5
Year of publication
1994
Pages
327 - 344
Database
ISI
SICI code
0178-4617(1994)12:4-5<327:SASABA>2.0.ZU;2-T
Abstract
Given a text string of length n and a pattern string of length m over a b-letter alphabet, the k differences approximate string matching pro blem asks for all locations in the text where the pattern occurs with at most k differences (substitutions, insertions, deletions). We treat k not as a constant but as a fraction of m (not necessarily constant- fraction). Previous algorithms require at least 0(kn) time (or exponen tial space). We give an algorithm that is sublinear time O((n/m)k log( b) m) when the text is random and k is bounded by the threshold m/(log (b) m + 0(1)). In particular, when k = o(m/log(b) m) the expected runn ing time is o(n). In the worst case our algorithm is 0(kn), but is sti ll an improvement in that it is practical and uses 0(m) space compared with 0(n) or 0(m2). We define three problems motivated by molecular b iology and describe efficient algorithms based on our techniques: (1) approximate substring matching, (2) approximate-overlap detection, and (3) approximate codon matching. Respectively, applications to biology are local similarity search, sequence assembly, and DNA-protein match ing.