Software failures are now known to be a dominant source of system outages.
Several studies and much anecdotal evidence point to "software aging" as a
common phenomenon, in which the state of a software system degrades with ti
me. Exhaustion of system resources, data corruption, and numerical error ac
cumulation are the primary symptoms of this degradation, which may eventual
ly lead to performance degradation of the software, crash/hang failure, or
other undesirable effects. "Software rejuvenation" is a proactive technique
intended to reduce the probability of future unplanned outages due to agin
g. The basic idea is to pause or halt the running software, refresh its int
ernal state, and resume or restart it. Software rejuvenation can be perform
ed by relying on a variety of indicators of aging, or on the time elapsed s
ince the last rejuvenation. In response to the strong desire of customers t
o be provided with advance notice of unplanned outages, our group has devel
oped techniques that detect the occurrence of software aging due to resourc
e exhaustion, estimate the time remaining until the exhaustion reaches a cr
itical level, and automatically perform proactive software rejuvenation of
an application, process group, or entire operating system, depending on the
pervasiveness of the resource exhaustion and our ability to pinpoint the s
ource. This technology has been incorporated into the IBM Director for xSer
ies servers. To quantitatively evaluate the impact of different rejuvenatio
n policies on the availability of cluster systems, we have developed analyt
ical models based on stochastic reward nets (SRNs). For time-based rejuvena
tion policies, we determined the optimal rejuvenation interval based on sys
tem availability and cost. We also analyzed a rejuvenation policy based on
prediction, and showed that it can further increase system availability and
reduce downtime cost. These models are very general and can capture a mult
itude of cluster system characteristics, failure behavior, and performabili
ty measures, which we are just beginning to explore.