ITA
ENG

CORRECTING SEQUENCING ERRORS IN DNA CODING REGIONS USING A DYNAMIC-PROGRAMMING APPROACH

Authors

XU Y MURAL RJ UBERBACHER EC

Citation

Y. Xu et al., CORRECTING SEQUENCING ERRORS IN DNA CODING REGIONS USING A DYNAMIC-PROGRAMMING APPROACH, Computer applications in the biosciences, 11(2), 1995, pp. 117-124

Citations number

Categorie Soggetti

Mathematical Methods, Biology & Medicine","Computer Sciences, Special Topics","Computer Science Interdisciplinary Applications","Biology Miscellaneous

Journal title

Computer applications in the biosciences → ACNP

ISSN journal

02667061

Volume

Issue

Year of publication

1995

Pages

117 - 124

Database

ISI

SICI code

0266-7061(1995)11:2<117:CSEIDC>2.0.ZU;2-9

Abstract

This paper presents an algorithm for detecting and 'correcting' sequen cing errors that occur in DNA coding legions. The types of sequencing errors addressed are insertions and deletions (indels) of DNA bases. T he goal is to provide a capability which makes single-pass or low-redu ndancy sequence data more informative, reducing the need for high-redu ndancy sequencing for gene identification and characterization purpose s. This would permit improved sequencing efficiency and reduce genome sequencing casts. The algorithm detects sequencing errors by discoveri ng changes in the statistically preferred reading frame within a putat ive coding region and then inserts a number of 'neutral' bases at a pe rceived reading frame transition point to make the putative exon candi date frame consistent. We have implemented the algorithm as a front-en d subsystem of the GRAIL DNA sequence analysis system to construct a v ersion which is very error tolerant and also intend to use this as a t estbed for further development of sequencing error-correction technolo gy. Preliminary test results have shown the usefulness of this algorit hm and also exhibited some of its weakness, providing possible directi ons for further improvement. On a test set consisting of 68 human DNA sequences with 1% randomly generated indels in coding regions, the alg orithm detected and corrected 76% of the indels. The average distance between the position of an indel and the predicted one was 9.4 bases, With this subsystem in place, GRAIL correctly predicted 89% of the cod ing messages with 10% false message on the 'corrected' sequences, comp ared to 69% correctly predicted coding messages and 11% falsely predic ted messages on the 'corrupted' sequences using standard GRAIL II meth od (version 1.2). The method uses a dynamic programming algorithm, and runs in time and space linear to the size of the input sequence.