This paper demonstrates a methodology for evaluating the fault-toleran
ce characteristics of operational software, and illustrates it through
case studies of 3 operating systems: Tandem GUARDIAN fault-tolerant s
ystem, VAX/VMS distributed system, IBM/MVS system. Based on measuremen
ts from these systems, software-error characteristics are investigated
via the analysis of error distributions and correlations. Two levels
of models are developed to analyze the error & recovery processes insi
de an operating system and the interactions among multiple copies of a
n operating system running in a distributed environment. Reward analys
is is used to evaluate the loss of service due to software errors and
the effect of fault-tolerance techniques implemented in the systems. O
ur conclusions follow. Software errors tend to occur in bursts on both
IBM & VAX machines. This is less pronounced in the Tandem system, whi
ch can be attributed to its fault-tolerant design. The Tandem-system f
ault-tolerance reduces the service loss due to software failures by a
factor of 10. Recovery routines in the IBM/MVS system are effective in
that they prevent system failures under most software-error condition
s. For software failures, approximately 10% from the VAXcluster and 20
% from the Tandem system occur concurrently on multiple machines. A mu
lticomputer software Time To Error distribution can be modeled by a 2-
phase hyperexponential random variable: A lower error rate which captu
res regular errors, and a higher error rate which captures error burst
s and concurrent errors on multiple machines.