ITA
ENG

Proactive management of software aging

Authors

Castelli, V Harper, RE Heidelberger, P Hunter, SW Trivedi, KS Vaidyanathan, K Zeggert, WP

Citation

V. Castelli et al., Proactive management of software aging, IBM J RES, 45(2), 2001, pp. 311-332

Citations number

Categorie Soggetti

Multidisciplinary,"Computer Science & Engineering

Journal title

IBM JOURNAL OF RESEARCH AND DEVELOPMENT

ISSN journal

00188646 → ACNP

Volume

Issue

Year of publication

2001

Pages

311 - 332

Database

ISI

SICI code

0018-8646(200103)45:2<311:PMOSA>2.0.ZU;2-N

Abstract

Software failures are now known to be a dominant source of system outages. Several studies and much anecdotal evidence point to "software aging" as a common phenomenon, in which the state of a software system degrades with ti me. Exhaustion of system resources, data corruption, and numerical error ac cumulation are the primary symptoms of this degradation, which may eventual ly lead to performance degradation of the software, crash/hang failure, or other undesirable effects. "Software rejuvenation" is a proactive technique intended to reduce the probability of future unplanned outages due to agin g. The basic idea is to pause or halt the running software, refresh its int ernal state, and resume or restart it. Software rejuvenation can be perform ed by relying on a variety of indicators of aging, or on the time elapsed s ince the last rejuvenation. In response to the strong desire of customers t o be provided with advance notice of unplanned outages, our group has devel oped techniques that detect the occurrence of software aging due to resourc e exhaustion, estimate the time remaining until the exhaustion reaches a cr itical level, and automatically perform proactive software rejuvenation of an application, process group, or entire operating system, depending on the pervasiveness of the resource exhaustion and our ability to pinpoint the s ource. This technology has been incorporated into the IBM Director for xSer ies servers. To quantitatively evaluate the impact of different rejuvenatio n policies on the availability of cluster systems, we have developed analyt ical models based on stochastic reward nets (SRNs). For time-based rejuvena tion policies, we determined the optimal rejuvenation interval based on sys tem availability and cost. We also analyzed a rejuvenation policy based on prediction, and showed that it can further increase system availability and reduce downtime cost. These models are very general and can capture a mult itude of cluster system characteristics, failure behavior, and performabili ty measures, which we are just beginning to explore.