The occurrence of faults in multicomputers with hundreds or thousands of no
des is a likely event that can be dealt with hardware or software fault-tol
erant approaches. This paper presents a unifying model that describes softw
are reconfiguration strategies for parallel applications with regular compu
tational pattern. We show that most existing strategies can be obtained as
instances of the proposed threshold-based reconfiguration meta-algorithm. M
oreover, this approach is useful to discover several yet unexplored strateg
ies among which we consider the class of the adaptive threshold-based strat
egies. The performance optimization analysis demonstrates that these strate
gies, applied to data-parallel regular computations, give optimal results f
or worst fault patterns. A wide spectrum of simulations, where the system p
arameters have been settled to those of actual multicomputers, confirms tha
t adaptive threshold-based strategies yield the most stable performance for
a variety of workloads, independently of the number and pattern of faults.
(C) 1998 Academic Press.