Software fault tolerance is the task of detecting and recovering from
failures that are not handled in the underlying hardware or operating
system layers of an application. Software rejuvenation prevents failur
es by periodically, and gracefully, terminating an application and res
tarting it at a clean internal state. This paper describes five reusab
le software components that provide these capabilities. They perform a
utomatic detection and restart of failed processes, checkpointing and
recovery of data in memory, replication and synchronization of files,
and software rejuvenation. These components, which have been ported to
a number of UNIX platforms, can be used in any application with mini
mal programming effort. The fault tolerance capabilities of several co
mmunication products and services in AT&T have been enhanced by incorp
orating these components. Experience with these products to date indic
ates that the components provide efficient, economical means to increa
se the level of fault tolerance in an application.