CORRECTING SEQUENCING ERRORS IN DNA CODING REGIONS USING A DYNAMIC-PROGRAMMING APPROACH

Citation
Y. Xu et al., CORRECTING SEQUENCING ERRORS IN DNA CODING REGIONS USING A DYNAMIC-PROGRAMMING APPROACH, Computer applications in the biosciences, 11(2), 1995, pp. 117-124
Citations number
9
Categorie Soggetti
Mathematical Methods, Biology & Medicine","Computer Sciences, Special Topics","Computer Science Interdisciplinary Applications","Biology Miscellaneous
ISSN journal
02667061
Volume
11
Issue
2
Year of publication
1995
Pages
117 - 124
Database
ISI
SICI code
0266-7061(1995)11:2<117:CSEIDC>2.0.ZU;2-9
Abstract
This paper presents an algorithm for detecting and 'correcting' sequen cing errors that occur in DNA coding legions. The types of sequencing errors addressed are insertions and deletions (indels) of DNA bases. T he goal is to provide a capability which makes single-pass or low-redu ndancy sequence data more informative, reducing the need for high-redu ndancy sequencing for gene identification and characterization purpose s. This would permit improved sequencing efficiency and reduce genome sequencing casts. The algorithm detects sequencing errors by discoveri ng changes in the statistically preferred reading frame within a putat ive coding region and then inserts a number of 'neutral' bases at a pe rceived reading frame transition point to make the putative exon candi date frame consistent. We have implemented the algorithm as a front-en d subsystem of the GRAIL DNA sequence analysis system to construct a v ersion which is very error tolerant and also intend to use this as a t estbed for further development of sequencing error-correction technolo gy. Preliminary test results have shown the usefulness of this algorit hm and also exhibited some of its weakness, providing possible directi ons for further improvement. On a test set consisting of 68 human DNA sequences with 1% randomly generated indels in coding regions, the alg orithm detected and corrected 76% of the indels. The average distance between the position of an indel and the predicted one was 9.4 bases, With this subsystem in place, GRAIL correctly predicted 89% of the cod ing messages with 10% false message on the 'corrected' sequences, comp ared to 69% correctly predicted coding messages and 11% falsely predic ted messages on the 'corrupted' sequences using standard GRAIL II meth od (version 1.2). The method uses a dynamic programming algorithm, and runs in time and space linear to the size of the input sequence.