ITA
ENG

NODE-COVERING, ERROR-CORRECTING CODES AND MULTIPROCESSORS WITH VERY HIGH AVERAGE FAULT-TOLERANCE

Authors

DUTT S MAHAPATRA NR

Citation

S. Dutt et Nr. Mahapatra, NODE-COVERING, ERROR-CORRECTING CODES AND MULTIPROCESSORS WITH VERY HIGH AVERAGE FAULT-TOLERANCE, I.E.E.E. transactions on computers, 46(9), 1997, pp. 997-1015

Citations number

Categorie Soggetti

Computer Sciences","Engineering, Eletrical & Electronic","Computer Science Hardware & Architecture

Journal title

I.E.E.E. transactions on computers → ACNP

ISSN journal

00189340

Volume

Issue

Year of publication

1997

Pages

997 - 1015

Database

ISI

SICI code

0018-9340(1997)46:9<997:NECAMW>2.0.ZU;2-F

Abstract

Structural fault tolerance (SFT) is the ability of a multiprocessor to reconfigure around faulty processors or links in order to preserve it s original processor interconnection structure. In this paper, we focu s on the design of SFT multiprocessors that have low switch and link o verheads, but can tolerate a very large number of processor faults on the average. Most previous work has concentrated on deterministic k-fa ult-toterant (k-FT) designs in which exactly k spare processors and so me spare switches and links are added to construct multiprocessors tha t can tolerate any k processor faults. However, after k faults are rec onfigured around, much of the extra links and switches can remain unut ilized. It is possible within the basic node-covering framework, which was introduced by Dutt and Hayes as an efficient k-FT design method, to design FT multiprocessors that have the same amount of switches and [inks as, say, a two-FT deterministic design, but have s spare proces sors, where s much greater than 2, so that, on the average, k = theta( s) (k less than or equal to s) processor failures can be reconfigured around. Such designs utilize the spare link and switch capacity very e fficiently, and are called probabilistic FT designs. An elegant and po werful method to construct covering graphs or CG's, which are key to o btaining the probabilistic FT designs, is to use linear error-correcti ng codes (ECCs). We show how to construct probabilistic designs with v ery high average fault tolerance but low wiring and switch overhead us ing ECCs like the 2D-parity, full-two, 3D-parity, and full-three codes . This design methodology is applicable to any multiprocessor intercon nection topology and the resulting FT designs have the same node degre e as the non-FT target topology. We also analyze the deterministic fau lt tolerance for these designs and develop efficient layout strategies for them. Finally, we compare the proposed probabilistic designs to s ome of the best deterministic and probabilistic designs proposed in th e past, and show that our designs can meet a given mean-time-to-failur e (MTTF) specification at much lower hardware costs (switch complexity , redundant wiring area, and spare-processor overhead) than previous d esigns. Further, for a given number of spare processors, our designs h ave close-to-optimal reconfigurabilities that are much better than tho se of previous probabilistic designs.