S. Dutt et Ft. Assaad, MANTISSA-PRESERVING OPERATIONS AND ROBUST ALGORITHM-BASED FAULT-TOLERANCE FOR MATRIX COMPUTATIONS, I.E.E.E. transactions on computers, 45(4), 1996, pp. 408-424
A system-level method for achieving fault tolerance called algorithm-b
ased fault tolerance (ABFT) has been proposed by a number of researche
rs. Many ABFT schemes use a floating-point checksum test to detect com
putation errors resulting from hardware faults. This makes the tests s
usceptible to roundoff inaccuracies in floating-point operations, whic
h either cause false alarms or lead to undetected errors. Thresholding
of the equality test has been commonly used to avoid false alarms; ho
wever, a good threshold that minimizes false alarms without reducing t
he error coverage significantly is difficult to find, especially when
not much is known about the input data. Furthermore, thresholded check
sums will inevitably miss lower-bit errors, which can get magnified as
a computation such as LU decomposition progresses. Here we develop a
theory for applying integer mantissa checksum tests to ''mantissa-pres
erving'' floating-point computations. This test is not susceptible to
roundoff problems and yields 100% error coverage without false alarms.
For computations that are not fully mantissa-preserving, we show how
to apply the mantissa checksum test to the mantissa-preserving compone
nts of the computation and the floating-point test to the rest of the
computation. We apply this general methodology to matrix-matrix multip
lication and LU decomposition (using the Gaussian elimination (GE) alg
orithm), and find that the accuracy of this new ''hybrid'' testing sch
eme is substantially higher than the floating-point test with threshol
ding, and also that its time overhead with respect to the floating-poi
nt test is nominal (15% and 9.5% on the average for matrix multiplicat
ion and LU decomposition, respectively). The hybrid test can also be e
asily applied to other computations like matrix inversion that use the
GE algorithm. We prove that the mantissa-based integer checksum test
for both matrix multiplication and LU decomposition is able to detect
at least three errors in the floating-point multiplication component o
f these computations. For LU decomposition, it is also able to correct
a single error in the floating-point multiplies.