Euler circuits and DNA sequencing by hybridization

Citation
R. Arratia et al., Euler circuits and DNA sequencing by hybridization, DISCR APP M, 104(1-3), 2000, pp. 63-96
Citations number
31
Categorie Soggetti
Engineering Mathematics
Volume
104
Issue
1-3
Year of publication
2000
Pages
63 - 96
Database
ISI
SICI code
Abstract
Sequencing by hybridization is a method of reconstructing a long DNA string - that is, figuring out its nucleotide sequence - from knowledge of its sh ort substrings. Unique reconstruction is not always possible, and the goal of this paper is to study the number of reconstructions of a random string. For a given string, the number of reconstructions is determined by the pat tern of repeated substrings; in an appropriate limit substrings will occur at most twice, so the pattern of repeats is given by a pairing: a string of length 2n in which each symbol occurs twice. A pairing induces a 2-in, 2-o ut graph, whose directed edges are defined by successive symbols of the pai ring - for example the pairing ABBCAC induces the graph with edges AB, BE, BC, and so forth - and the number of reconstructions is simply the number o f Euler circuits in this 2-in, 2-out graph. The original problem is thus tr ansformed into one about pairings: to find the number f(k)(n) of n-symbol p airings having k Euler circuits. We show how to compute this function, in closed form, for any fixed k, and we present the functions explicitly for k = 1,...,9. The key is a decomposi tion theorem: the Euler "circuit number" of a pairing is the product of the circuit numbers of "component" sub-pairings. These components come from co nnected components of the "interlace graph", which has the: pairing's symbo ls as vertices, and edges when symbols are "interlaced". (A and B are inter laced if the pairing has the form A B A B or B A B A.) We carry these resul ts back to the original question about DNA strings, and provide a total var iation distance upper bound for the approximation error. We perform an asymptotic enumeration of 2-in, 2-out digraphs to show that, for a typical random n-pairing, the number of Euler circuits is of order no smaller than 2(n)/n, and the expected number is asymptotically at least e( -1/2)2(n-1)/n. Since any n-pairing has at most 2(n-1) Euler circuits, this pinpoints the exponential growth rate. (C) 2000 Elsevier Science B.V. All r ights reserved.