ITA
ENG

DNA-SEQUENCING AND STRING LEARNING

Authors

JIANG T LI M

Citation

T. Jiang et M. Li, DNA-SEQUENCING AND STRING LEARNING, Mathematical systems theory, 29(4), 1996, pp. 387-405

Citations number

Categorie Soggetti

System Science","Mathematics, Pure","Computer Science Theory & Methods",Mathematics

Journal title

Mathematical systems theory → ACNP

ISSN journal

00255661

Volume

Issue

Year of publication

1996

Pages

387 - 405

Database

ISI

SICI code

0025-5661(1996)29:4<387:DASL>2.0.ZU;2-P

Abstract

In laboratories the majority of large-scale DNA sequencing is done fol lowing the shotgun strategy, which is to sequence large amount of rela tively short fragments randomly and then heuristically find a shortest common superstring of the fragments [26]. We study mathematical frame works, under plausible assumptions, suitable for massive automated DNA sequencing and for analyzing DNA sequencing algorithms. We model the DNA sequencing problem as learning a string from its randomly drawn su bstrings. Under certain restrictions, this may be viewed as string lea rning in Valiant's distribution-free learning model and in this case w e give an efficient learning algorithm and a quantitative bound on how many examples suffice. One major obstacle to our approach turns out t o be a quite well-known open question on how to approximate a shortest common superstring of a set of strings, raised by a number of authors in the last 10 years [9], [29], [30]. We give the first provably good algorithm which approximates a shortest superstring of length n by a superstring of length O (n log n). The algorithm works equally well ev en in the presence of negative examples, i.e., when merging of some st rings is prohibited.