ITA
ENG

LOOP TRANSFORMATIONS FOR FAULT-DETECTION IN REGULAR LOOPS ON MASSIVELY-PARALLEL SYSTEMS

Authors

GONG C MELHEM R GUPTA R

Citation

C. Gong et al., LOOP TRANSFORMATIONS FOR FAULT-DETECTION IN REGULAR LOOPS ON MASSIVELY-PARALLEL SYSTEMS, IEEE transactions on parallel and distributed systems, 7(12), 1996, pp. 1238-1249

Citations number

Categorie Soggetti

System Science","Engineering, Eletrical & Electronic","Computer Science Theory & Methods

Journal title

IEEE transactions on parallel and distributed systems → ACNP

ISSN journal

10459219

Volume

Issue

Year of publication

1996

Pages

1238 - 1249

Database

ISI

SICI code

1045-9219(1996)7:12<1238:LTFFIR>2.0.ZU;2-O

Abstract

Distributed-memory systems can incorporate thousands of processors at a reasonable cost. However, with an increasing number of processors in a system, fault detection and fault tolerance become critical issues. By replicating the computation on more than one processor and compari ng the results produced by these processors, errors can be detected. D uring the execution or a program, due to data dependencies, typically not all of the processors in a multiprocessor system are busy at all t imes. Therefore processor schedules contain idle time slots and it is the goal of this work to exploit these idle time stets to schedule dup licated computation for the purpose of fault detection. We propose a c ompiler-assisted approach to fault detection in regular loops on distr ibuted-memory systems. This approach achieves fault detection by dupli cating the execution of statement instances. After carefully analyzing the data dependencies of a regular loop, selected instances of loop s tatements are duplicated in a way that ensures the desired fault cover age. We first present duplication strategies for fault detection and s how that these strategies use idle processor times for executing repli cated statements, whenever possible. Next, we present loop transformat ions to implement these fault-detection strategies. Also, a general fr amework for selecting appropriate loop transformations is developed. E xperimental results performed on the CRAY-T3D show that the overhead o f adding the fault detection capability is usually less than 25%, and is less than 10% when communication overhead is reduced by grouping me ssages.