An algorithm for approximate tandem repeats

Citation
Gm. Landau et al., An algorithm for approximate tandem repeats, J COMPUT BI, 8(1), 2001, pp. 1-18
Citations number
26
Categorie Soggetti
Biochemistry & Biophysics
Journal title
JOURNAL OF COMPUTATIONAL BIOLOGY
ISSN journal
10665277 → ACNP
Volume
8
Issue
1
Year of publication
2001
Pages
1 - 18
Database
ISI
SICI code
1066-5277(2001)8:1<1:AAFATR>2.0.ZU;2-B
Abstract
A perfect single tandem repeat is defined as a nonempty string that can be divided into two identical substrings, e,g,, abcabc, An approximate single tandem repeat is one in which the substrings are similar, but not identical , e,g,, abcdaacd, In this paper we consider two criterions of similarity: t he Hamming distance (k mismatches) and the edit distance (k differences). F or a string S of length n and an integer k our algorithm reports all locall y optimal approximate repeats, r = (u) over bar(u) over cap, for which the Hamming distance of (u) over bar and (u) over cap is at most k, in O(nk log (n / k)) time, or all those for which the edit distance of (u) over bar and (u) over cap is at most k, in O(nk log k log(n / k)) time. This paper conc entrates on a more general type of repeat called multiple tandem repeats. A multiple tandem repeat in a sequence S is a (periodic) substring r of S of the form r = u(a)u ', where u is a prefix of r and u ' is a prefix of u, A n approximate multiple tandem repeat is a multiple repeat with errors; the repeated subsequences are similar but not identical, We precisely define ap proximate multiple repeats, and present an algorithm that finds all repeats that concur with our definition. The time complexity of the algorithm, whe n searching for repeats with up to k errors in a string S of length a, is O (nka log(n / k)) where a is the maximum number of periods in any reported r epeat. We present some experimental results concerning the performance and sensitivity of our algorithm, The problem of finding repeats within a strin g is a computational problem with important applications in the field of mo lecular biology. Both exact and inexact repeats occur frequently in the gen ome, and certain repeats occurring in the genome are known to be related to diseases in the human.