GUARDED REPAIR OF DEPENDABLE SYSTEMS

Citation
H. Demeer et al., GUARDED REPAIR OF DEPENDABLE SYSTEMS, Theoretical computer science, 128(1-2), 1994, pp. 179-210
Citations number
19
Categorie Soggetti
Computer Sciences","Computer Science Theory & Methods
ISSN journal
03043975
Volume
128
Issue
1-2
Year of publication
1994
Pages
179 - 210
Database
ISI
SICI code
0304-3975(1994)128:1-2<179:GRODS>2.0.ZU;2-Y
Abstract
Imperfect coverage and nonnegligible reconfiguration delay are known t o have a deleterious effect on the dependability and the performance o f a multiprocessor system. In particular, increasing the number of pro cessor elements does not always increase dependability. An obvious rea son for this is that the total failure rate increases, generally, line arly with the number of components in the system. It is also a well-kn own fact that the performance gain due to parallelism mostly turns out to be sublinear with the number of processors. It is therefore import ant to optimize the degree of parallelism in system design. A related issue is that by deferring repair, it is sometimes possible to improve system dependability. In this case decisions have to be made dynamica lly as to when to repair and when not to repair. Most of the current r esearch deals with static optimization of the number of processors. No systematic approach for dynamic control of dependable systems has bee n proposed so far. Dynamic, i.e. transient, decision of whether or not to repair is the optimization problem considered in this paper. We pr opose extended Markov reward models (EMRM) to capture such questions. EMRM are a marriage between performability modeling techniques and Mar kov decision theory. A numerical solution procedure is developed to pr ovide optimal solution trajectories for this problem. EMRM are a gener al framework for the dynamic optimization of reconfigurable, dependabl e systems. The optimization is applied on the basis of several perform ance and dependability measures. In particular, we explore availabilit y, capacity-oriented availability, performance-oriented unavailability , and performability measures. Furthermore, off-line and on-line repai r strategies are compared. We show that guarded repair can improve sys tem performance and dependability significantly. The control strategie s and reward functions differ a lot in each case. Each scenario turns out to be of interest in its own right. A time-dependent optimality of dependable, parallel configurations can be determined from our result s.