In laboratories the majority of large-scale DNA sequencing is done fol
lowing the shotgun strategy, which is to sequence large amount of rela
tively short fragments randomly and then heuristically find a shortest
common superstring of the fragments [26]. We study mathematical frame
works, under plausible assumptions, suitable for massive automated DNA
sequencing and for analyzing DNA sequencing algorithms. We model the
DNA sequencing problem as learning a string from its randomly drawn su
bstrings. Under certain restrictions, this may be viewed as string lea
rning in Valiant's distribution-free learning model and in this case w
e give an efficient learning algorithm and a quantitative bound on how
many examples suffice. One major obstacle to our approach turns out t
o be a quite well-known open question on how to approximate a shortest
common superstring of a set of strings, raised by a number of authors
in the last 10 years [9], [29], [30]. We give the first provably good
algorithm which approximates a shortest superstring of length n by a
superstring of length O (n log n). The algorithm works equally well ev
en in the presence of negative examples, i.e., when merging of some st
rings is prohibited.