Y. Xu et al., CORRECTING SEQUENCING ERRORS IN DNA CODING REGIONS USING A DYNAMIC-PROGRAMMING APPROACH, Computer applications in the biosciences, 11(2), 1995, pp. 117-124
This paper presents an algorithm for detecting and 'correcting' sequen
cing errors that occur in DNA coding legions. The types of sequencing
errors addressed are insertions and deletions (indels) of DNA bases. T
he goal is to provide a capability which makes single-pass or low-redu
ndancy sequence data more informative, reducing the need for high-redu
ndancy sequencing for gene identification and characterization purpose
s. This would permit improved sequencing efficiency and reduce genome
sequencing casts. The algorithm detects sequencing errors by discoveri
ng changes in the statistically preferred reading frame within a putat
ive coding region and then inserts a number of 'neutral' bases at a pe
rceived reading frame transition point to make the putative exon candi
date frame consistent. We have implemented the algorithm as a front-en
d subsystem of the GRAIL DNA sequence analysis system to construct a v
ersion which is very error tolerant and also intend to use this as a t
estbed for further development of sequencing error-correction technolo
gy. Preliminary test results have shown the usefulness of this algorit
hm and also exhibited some of its weakness, providing possible directi
ons for further improvement. On a test set consisting of 68 human DNA
sequences with 1% randomly generated indels in coding regions, the alg
orithm detected and corrected 76% of the indels. The average distance
between the position of an indel and the predicted one was 9.4 bases,
With this subsystem in place, GRAIL correctly predicted 89% of the cod
ing messages with 10% false message on the 'corrected' sequences, comp
ared to 69% correctly predicted coding messages and 11% falsely predic
ted messages on the 'corrupted' sequences using standard GRAIL II meth
od (version 1.2). The method uses a dynamic programming algorithm, and
runs in time and space linear to the size of the input sequence.