S. Dutt et Nr. Mahapatra, NODE-COVERING, ERROR-CORRECTING CODES AND MULTIPROCESSORS WITH VERY HIGH AVERAGE FAULT-TOLERANCE, I.E.E.E. transactions on computers, 46(9), 1997, pp. 997-1015
Structural fault tolerance (SFT) is the ability of a multiprocessor to
reconfigure around faulty processors or links in order to preserve it
s original processor interconnection structure. In this paper, we focu
s on the design of SFT multiprocessors that have low switch and link o
verheads, but can tolerate a very large number of processor faults on
the average. Most previous work has concentrated on deterministic k-fa
ult-toterant (k-FT) designs in which exactly k spare processors and so
me spare switches and links are added to construct multiprocessors tha
t can tolerate any k processor faults. However, after k faults are rec
onfigured around, much of the extra links and switches can remain unut
ilized. It is possible within the basic node-covering framework, which
was introduced by Dutt and Hayes as an efficient k-FT design method,
to design FT multiprocessors that have the same amount of switches and
[inks as, say, a two-FT deterministic design, but have s spare proces
sors, where s much greater than 2, so that, on the average, k = theta(
s) (k less than or equal to s) processor failures can be reconfigured
around. Such designs utilize the spare link and switch capacity very e
fficiently, and are called probabilistic FT designs. An elegant and po
werful method to construct covering graphs or CG's, which are key to o
btaining the probabilistic FT designs, is to use linear error-correcti
ng codes (ECCs). We show how to construct probabilistic designs with v
ery high average fault tolerance but low wiring and switch overhead us
ing ECCs like the 2D-parity, full-two, 3D-parity, and full-three codes
. This design methodology is applicable to any multiprocessor intercon
nection topology and the resulting FT designs have the same node degre
e as the non-FT target topology. We also analyze the deterministic fau
lt tolerance for these designs and develop efficient layout strategies
for them. Finally, we compare the proposed probabilistic designs to s
ome of the best deterministic and probabilistic designs proposed in th
e past, and show that our designs can meet a given mean-time-to-failur
e (MTTF) specification at much lower hardware costs (switch complexity
, redundant wiring area, and spare-processor overhead) than previous d
esigns. Further, for a given number of spare processors, our designs h
ave close-to-optimal reconfigurabilities that are much better than tho
se of previous probabilistic designs.