Proactive management of software aging

Citation
V. Castelli et al., Proactive management of software aging, IBM J RES, 45(2), 2001, pp. 311-332
Citations number
37
Categorie Soggetti
Multidisciplinary,"Computer Science & Engineering
Journal title
IBM JOURNAL OF RESEARCH AND DEVELOPMENT
ISSN journal
00188646 → ACNP
Volume
45
Issue
2
Year of publication
2001
Pages
311 - 332
Database
ISI
SICI code
0018-8646(200103)45:2<311:PMOSA>2.0.ZU;2-N
Abstract
Software failures are now known to be a dominant source of system outages. Several studies and much anecdotal evidence point to "software aging" as a common phenomenon, in which the state of a software system degrades with ti me. Exhaustion of system resources, data corruption, and numerical error ac cumulation are the primary symptoms of this degradation, which may eventual ly lead to performance degradation of the software, crash/hang failure, or other undesirable effects. "Software rejuvenation" is a proactive technique intended to reduce the probability of future unplanned outages due to agin g. The basic idea is to pause or halt the running software, refresh its int ernal state, and resume or restart it. Software rejuvenation can be perform ed by relying on a variety of indicators of aging, or on the time elapsed s ince the last rejuvenation. In response to the strong desire of customers t o be provided with advance notice of unplanned outages, our group has devel oped techniques that detect the occurrence of software aging due to resourc e exhaustion, estimate the time remaining until the exhaustion reaches a cr itical level, and automatically perform proactive software rejuvenation of an application, process group, or entire operating system, depending on the pervasiveness of the resource exhaustion and our ability to pinpoint the s ource. This technology has been incorporated into the IBM Director for xSer ies servers. To quantitatively evaluate the impact of different rejuvenatio n policies on the availability of cluster systems, we have developed analyt ical models based on stochastic reward nets (SRNs). For time-based rejuvena tion policies, we determined the optimal rejuvenation interval based on sys tem availability and cost. We also analyzed a rejuvenation policy based on prediction, and showed that it can further increase system availability and reduce downtime cost. These models are very general and can capture a mult itude of cluster system characteristics, failure behavior, and performabili ty measures, which we are just beginning to explore.