DNA-SEQUENCING AND STRING LEARNING

Authors
Citation
T. Jiang et M. Li, DNA-SEQUENCING AND STRING LEARNING, Mathematical systems theory, 29(4), 1996, pp. 387-405
Citations number
33
Categorie Soggetti
System Science","Mathematics, Pure","Computer Science Theory & Methods",Mathematics
Journal title
ISSN journal
00255661
Volume
29
Issue
4
Year of publication
1996
Pages
387 - 405
Database
ISI
SICI code
0025-5661(1996)29:4<387:DASL>2.0.ZU;2-P
Abstract
In laboratories the majority of large-scale DNA sequencing is done fol lowing the shotgun strategy, which is to sequence large amount of rela tively short fragments randomly and then heuristically find a shortest common superstring of the fragments [26]. We study mathematical frame works, under plausible assumptions, suitable for massive automated DNA sequencing and for analyzing DNA sequencing algorithms. We model the DNA sequencing problem as learning a string from its randomly drawn su bstrings. Under certain restrictions, this may be viewed as string lea rning in Valiant's distribution-free learning model and in this case w e give an efficient learning algorithm and a quantitative bound on how many examples suffice. One major obstacle to our approach turns out t o be a quite well-known open question on how to approximate a shortest common superstring of a set of strings, raised by a number of authors in the last 10 years [9], [29], [30]. We give the first provably good algorithm which approximates a shortest superstring of length n by a superstring of length O (n log n). The algorithm works equally well ev en in the presence of negative examples, i.e., when merging of some st rings is prohibited.