Sequencing by hybridization is a method of reconstructing a long DNA string
- that is, figuring out its nucleotide sequence - from knowledge of its sh
ort substrings. Unique reconstruction is not always possible, and the goal
of this paper is to study the number of reconstructions of a random string.
For a given string, the number of reconstructions is determined by the pat
tern of repeated substrings; in an appropriate limit substrings will occur
at most twice, so the pattern of repeats is given by a pairing: a string of
length 2n in which each symbol occurs twice. A pairing induces a 2-in, 2-o
ut graph, whose directed edges are defined by successive symbols of the pai
ring - for example the pairing ABBCAC induces the graph with edges AB, BE,
BC, and so forth - and the number of reconstructions is simply the number o
f Euler circuits in this 2-in, 2-out graph. The original problem is thus tr
ansformed into one about pairings: to find the number f(k)(n) of n-symbol p
airings having k Euler circuits.
We show how to compute this function, in closed form, for any fixed k, and
we present the functions explicitly for k = 1,...,9. The key is a decomposi
tion theorem: the Euler "circuit number" of a pairing is the product of the
circuit numbers of "component" sub-pairings. These components come from co
nnected components of the "interlace graph", which has the: pairing's symbo
ls as vertices, and edges when symbols are "interlaced". (A and B are inter
laced if the pairing has the form A B A B or B A B A.) We carry these resul
ts back to the original question about DNA strings, and provide a total var
iation distance upper bound for the approximation error.
We perform an asymptotic enumeration of 2-in, 2-out digraphs to show that,
for a typical random n-pairing, the number of Euler circuits is of order no
smaller than 2(n)/n, and the expected number is asymptotically at least e(
-1/2)2(n-1)/n. Since any n-pairing has at most 2(n-1) Euler circuits, this
pinpoints the exponential growth rate. (C) 2000 Elsevier Science B.V. All r
ights reserved.