MEASUREMENT-BASED EVALUATION OF OPERATING SYSTEM FAULT-TOLERANCE

Citation
I. Lee et al., MEASUREMENT-BASED EVALUATION OF OPERATING SYSTEM FAULT-TOLERANCE, IEEE transactions on reliability, 42(2), 1993, pp. 238-249
Citations number
30
Categorie Soggetti
Operatione Research & Management Science","Statistic & Probability",Engineering,"Engineering, Eletrical & Electronic","Computer Applications & Cybernetics
ISSN journal
00189529
Volume
42
Issue
2
Year of publication
1993
Pages
238 - 249
Database
ISI
SICI code
0018-9529(1993)42:2<238:MEOOSF>2.0.ZU;2-C
Abstract
This paper demonstrates a methodology for evaluating the fault-toleran ce characteristics of operational software, and illustrates it through case studies of 3 operating systems: Tandem GUARDIAN fault-tolerant s ystem, VAX/VMS distributed system, IBM/MVS system. Based on measuremen ts from these systems, software-error characteristics are investigated via the analysis of error distributions and correlations. Two levels of models are developed to analyze the error & recovery processes insi de an operating system and the interactions among multiple copies of a n operating system running in a distributed environment. Reward analys is is used to evaluate the loss of service due to software errors and the effect of fault-tolerance techniques implemented in the systems. O ur conclusions follow. Software errors tend to occur in bursts on both IBM & VAX machines. This is less pronounced in the Tandem system, whi ch can be attributed to its fault-tolerant design. The Tandem-system f ault-tolerance reduces the service loss due to software failures by a factor of 10. Recovery routines in the IBM/MVS system are effective in that they prevent system failures under most software-error condition s. For software failures, approximately 10% from the VAXcluster and 20 % from the Tandem system occur concurrently on multiple machines. A mu lticomputer software Time To Error distribution can be modeled by a 2- phase hyperexponential random variable: A lower error rate which captu res regular errors, and a higher error rate which captures error burst s and concurrent errors on multiple machines.