USING MULTISTAGE AND STRATIFIED SAMPLING FOR INFERRING FAULT-COVERAGEPROBABILITIES

Citation
C. Constantinescu, USING MULTISTAGE AND STRATIFIED SAMPLING FOR INFERRING FAULT-COVERAGEPROBABILITIES, IEEE transactions on reliability, 44(4), 1995, pp. 632-639
Citations number
29
Categorie Soggetti
Computer Sciences","Engineering, Eletrical & Electronic","Computer Science Hardware & Architecture","Computer Science Software Graphycs Programming
ISSN journal
00189529
Volume
44
Issue
4
Year of publication
1995
Pages
632 - 639
Database
ISI
SICI code
0018-9529(1995)44:4<632:UMASSF>2.0.ZU;2-B
Abstract
Development of fault-tolerant computing systems requires accurate reli ability modeling, Analytic, simulation, and hybrid models are commonly used for obtaining reliability measures. These measures are functions of component failure rates and fault-coverage (probabilities). Covera ge provides information about the fault & error detection, isolation, and system recovery capabilities, This parameter can be derived by phy sical or simulated fault injection. Unfortunately, the complexity of m odern computing systems makes: exhaustive testing intractable, As a co nsequence, statistical inference has been used to extract meaningful i nformation from sample observation, The problem of conducting fault in jection experiments and statistically inferring the coverage from the information gathered in those experiments is addressed in this paper, The methods previously used for estimating the coverage considered onl y 4 few factors which influence the coverage, By contrast, we perform statistical experiments in a multi-dimensional space of events. In thi s way all major factors which influence the coverage (fault locations, timing characteristics of the fault, and the workload) are accounted for, For process control computers, the combination of input values an d fault occurrence times provides information about the workload which is executed, Multi-stage, stratified, and combined multi-stage & stra tified sampling are used in this: paper for deriving the coverage. Equ ations of the mean, variance, and confidence interval of the coverage are provided, The statistical error produced by the injected faults wh ich do not induce errors in tbe tested system (also known as the nonre sponse problem) is considered, A program which emulates a typical faul t environment was developed and four hypothetical systems are analyzed , These systems are characterized by coverages in the 0.90 - 0.9999 ra nge and a 10(12) fault space. The confidence intervals of the coverage are derived and checked against known true values. The main advantage s of this approach are: fault injection is performed in a multidimensi onal space of events, and accounts for all major factors which affect the coverage: fault location, timing characteristics of the fault and system workload. randomness 'which characterizes the fault occurrence and error propagation in a real computer' is preserved throughout the fault injection experiment. coverage estimators are provided in a gene ral form, Thus the same equations can be used for various implications by choosing the proper number of stages and strata. method applies bo th for physical & simulated fault injection. The assumption of normali ty is the main limitation of the method, However, experiments performe d for various dimensions of the fault space and values of the coverage and reported by some researchers have confirmed the adequacy of this assumption.