NODE-COVERING, ERROR-CORRECTING CODES AND MULTIPROCESSORS WITH VERY HIGH AVERAGE FAULT-TOLERANCE

Citation
S. Dutt et Nr. Mahapatra, NODE-COVERING, ERROR-CORRECTING CODES AND MULTIPROCESSORS WITH VERY HIGH AVERAGE FAULT-TOLERANCE, I.E.E.E. transactions on computers, 46(9), 1997, pp. 997-1015
Citations number
29
Categorie Soggetti
Computer Sciences","Engineering, Eletrical & Electronic","Computer Science Hardware & Architecture
ISSN journal
00189340
Volume
46
Issue
9
Year of publication
1997
Pages
997 - 1015
Database
ISI
SICI code
0018-9340(1997)46:9<997:NECAMW>2.0.ZU;2-F
Abstract
Structural fault tolerance (SFT) is the ability of a multiprocessor to reconfigure around faulty processors or links in order to preserve it s original processor interconnection structure. In this paper, we focu s on the design of SFT multiprocessors that have low switch and link o verheads, but can tolerate a very large number of processor faults on the average. Most previous work has concentrated on deterministic k-fa ult-toterant (k-FT) designs in which exactly k spare processors and so me spare switches and links are added to construct multiprocessors tha t can tolerate any k processor faults. However, after k faults are rec onfigured around, much of the extra links and switches can remain unut ilized. It is possible within the basic node-covering framework, which was introduced by Dutt and Hayes as an efficient k-FT design method, to design FT multiprocessors that have the same amount of switches and [inks as, say, a two-FT deterministic design, but have s spare proces sors, where s much greater than 2, so that, on the average, k = theta( s) (k less than or equal to s) processor failures can be reconfigured around. Such designs utilize the spare link and switch capacity very e fficiently, and are called probabilistic FT designs. An elegant and po werful method to construct covering graphs or CG's, which are key to o btaining the probabilistic FT designs, is to use linear error-correcti ng codes (ECCs). We show how to construct probabilistic designs with v ery high average fault tolerance but low wiring and switch overhead us ing ECCs like the 2D-parity, full-two, 3D-parity, and full-three codes . This design methodology is applicable to any multiprocessor intercon nection topology and the resulting FT designs have the same node degre e as the non-FT target topology. We also analyze the deterministic fau lt tolerance for these designs and develop efficient layout strategies for them. Finally, we compare the proposed probabilistic designs to s ome of the best deterministic and probabilistic designs proposed in th e past, and show that our designs can meet a given mean-time-to-failur e (MTTF) specification at much lower hardware costs (switch complexity , redundant wiring area, and spare-processor overhead) than previous d esigns. Further, for a given number of spare processors, our designs h ave close-to-optimal reconfigurabilities that are much better than tho se of previous probabilistic designs.