ITA
ENG

MANTISSA-PRESERVING OPERATIONS AND ROBUST ALGORITHM-BASED FAULT-TOLERANCE FOR MATRIX COMPUTATIONS

Authors

DUTT S ASSAAD FT

Citation

S. Dutt et Ft. Assaad, MANTISSA-PRESERVING OPERATIONS AND ROBUST ALGORITHM-BASED FAULT-TOLERANCE FOR MATRIX COMPUTATIONS, I.E.E.E. transactions on computers, 45(4), 1996, pp. 408-424

Citations number

Categorie Soggetti

Computer Sciences","Engineering, Eletrical & Electronic","Computer Science Hardware & Architecture

Journal title

I.E.E.E. transactions on computers → ACNP

ISSN journal

00189340

Volume

Issue

Year of publication

1996

Pages

408 - 424

Database

ISI

SICI code

0018-9340(1996)45:4<408:MOARAF>2.0.ZU;2-3

Abstract

A system-level method for achieving fault tolerance called algorithm-b ased fault tolerance (ABFT) has been proposed by a number of researche rs. Many ABFT schemes use a floating-point checksum test to detect com putation errors resulting from hardware faults. This makes the tests s usceptible to roundoff inaccuracies in floating-point operations, whic h either cause false alarms or lead to undetected errors. Thresholding of the equality test has been commonly used to avoid false alarms; ho wever, a good threshold that minimizes false alarms without reducing t he error coverage significantly is difficult to find, especially when not much is known about the input data. Furthermore, thresholded check sums will inevitably miss lower-bit errors, which can get magnified as a computation such as LU decomposition progresses. Here we develop a theory for applying integer mantissa checksum tests to ''mantissa-pres erving'' floating-point computations. This test is not susceptible to roundoff problems and yields 100% error coverage without false alarms. For computations that are not fully mantissa-preserving, we show how to apply the mantissa checksum test to the mantissa-preserving compone nts of the computation and the floating-point test to the rest of the computation. We apply this general methodology to matrix-matrix multip lication and LU decomposition (using the Gaussian elimination (GE) alg orithm), and find that the accuracy of this new ''hybrid'' testing sch eme is substantially higher than the floating-point test with threshol ding, and also that its time overhead with respect to the floating-poi nt test is nominal (15% and 9.5% on the average for matrix multiplicat ion and LU decomposition, respectively). The hybrid test can also be e asily applied to other computations like matrix inversion that use the GE algorithm. We prove that the mantissa-based integer checksum test for both matrix multiplication and LU decomposition is able to detect at least three errors in the floating-point multiplication component o f these computations. For LU decomposition, it is also able to correct a single error in the floating-point multiplies.