Xj. Guan et Ec. Uberbacher, ALIGNMENTS OF DNA AND PROTEIN SEQUENCES CONTAINING FRAMESHIFT ERRORS, Computer applications in the biosciences, 12(1), 1996, pp. 31-40
Molecular sequences, like all experimental data, are subject to error.
Many current DNA sequencing protocols have very significant error rat
es and often generate artefactual insertions and deletions of bases (i
ndels) which corrupt the translation of sequences and compromise the d
etection of protein homologies. The impact of these errors on the util
ity of molecular sequence data is dependent on the analytic technique
used to interpret the data. In the presence of frameshift errors, stan
dard algorithms using six-frame translation can miss important homolog
ies because only subfragments of the correct translation are available
in any given frame. We present a new algorithm which can detect and c
orrect frameshift errors in DNA sequences during comparison of transla
ted sequences with protein sequences in the databases. This algorithm
can recognize homologous proteins sharing 30% identity even in the pre
sence of a 7% frameshift error rate. Our algorithm uses dynamic progra
mming, producing a guaranteed optimal alignment in the presence of fra
meshifts, and has a sensitivity equivalent to Smith-Waterman. The comp
utational efficiency of the algorithm is O(nm) where n and m are the s
izes of two sequences being compared. The algorithm does not rely on p
rior knowledge or heurisitic rules and performs significantly better t
han any previously reported method.