C. Constantinescu, USING MULTISTAGE AND STRATIFIED SAMPLING FOR INFERRING FAULT-COVERAGEPROBABILITIES, IEEE transactions on reliability, 44(4), 1995, pp. 632-639
Development of fault-tolerant computing systems requires accurate reli
ability modeling, Analytic, simulation, and hybrid models are commonly
used for obtaining reliability measures. These measures are functions
of component failure rates and fault-coverage (probabilities). Covera
ge provides information about the fault & error detection, isolation,
and system recovery capabilities, This parameter can be derived by phy
sical or simulated fault injection. Unfortunately, the complexity of m
odern computing systems makes: exhaustive testing intractable, As a co
nsequence, statistical inference has been used to extract meaningful i
nformation from sample observation, The problem of conducting fault in
jection experiments and statistically inferring the coverage from the
information gathered in those experiments is addressed in this paper,
The methods previously used for estimating the coverage considered onl
y 4 few factors which influence the coverage, By contrast, we perform
statistical experiments in a multi-dimensional space of events. In thi
s way all major factors which influence the coverage (fault locations,
timing characteristics of the fault, and the workload) are accounted
for, For process control computers, the combination of input values an
d fault occurrence times provides information about the workload which
is executed, Multi-stage, stratified, and combined multi-stage & stra
tified sampling are used in this: paper for deriving the coverage. Equ
ations of the mean, variance, and confidence interval of the coverage
are provided, The statistical error produced by the injected faults wh
ich do not induce errors in tbe tested system (also known as the nonre
sponse problem) is considered, A program which emulates a typical faul
t environment was developed and four hypothetical systems are analyzed
, These systems are characterized by coverages in the 0.90 - 0.9999 ra
nge and a 10(12) fault space. The confidence intervals of the coverage
are derived and checked against known true values. The main advantage
s of this approach are: fault injection is performed in a multidimensi
onal space of events, and accounts for all major factors which affect
the coverage: fault location, timing characteristics of the fault and
system workload. randomness 'which characterizes the fault occurrence
and error propagation in a real computer' is preserved throughout the
fault injection experiment. coverage estimators are provided in a gene
ral form, Thus the same equations can be used for various implications
by choosing the proper number of stages and strata. method applies bo
th for physical & simulated fault injection. The assumption of normali
ty is the main limitation of the method, However, experiments performe
d for various dimensions of the fault space and values of the coverage
and reported by some researchers have confirmed the adequacy of this
assumption.