ITA
ENG

MEASUREMENT-BASED EVALUATION OF OPERATING SYSTEM FAULT-TOLERANCE

Authors

LEE I TANG D IYER RK HSUEH MC

Citation

I. Lee et al., MEASUREMENT-BASED EVALUATION OF OPERATING SYSTEM FAULT-TOLERANCE, IEEE transactions on reliability, 42(2), 1993, pp. 238-249

Citations number

Categorie Soggetti

Operatione Research & Management Science","Statistic & Probability",Engineering,"Engineering, Eletrical & Electronic","Computer Applications & Cybernetics

Journal title

IEEE transactions on reliability → ACNP

ISSN journal

00189529

Volume

Issue

Year of publication

1993

Pages

238 - 249

Database

ISI

SICI code

0018-9529(1993)42:2<238:MEOOSF>2.0.ZU;2-C

Abstract

This paper demonstrates a methodology for evaluating the fault-toleran ce characteristics of operational software, and illustrates it through case studies of 3 operating systems: Tandem GUARDIAN fault-tolerant s ystem, VAX/VMS distributed system, IBM/MVS system. Based on measuremen ts from these systems, software-error characteristics are investigated via the analysis of error distributions and correlations. Two levels of models are developed to analyze the error & recovery processes insi de an operating system and the interactions among multiple copies of a n operating system running in a distributed environment. Reward analys is is used to evaluate the loss of service due to software errors and the effect of fault-tolerance techniques implemented in the systems. O ur conclusions follow. Software errors tend to occur in bursts on both IBM & VAX machines. This is less pronounced in the Tandem system, whi ch can be attributed to its fault-tolerant design. The Tandem-system f ault-tolerance reduces the service loss due to software failures by a factor of 10. Recovery routines in the IBM/MVS system are effective in that they prevent system failures under most software-error condition s. For software failures, approximately 10% from the VAXcluster and 20 % from the Tandem system occur concurrently on multiple machines. A mu lticomputer software Time To Error distribution can be modeled by a 2- phase hyperexponential random variable: A lower error rate which captu res regular errors, and a higher error rate which captures error burst s and concurrent errors on multiple machines.