COMPONENTS FOR SOFTWARE FAULT-TOLERANCE AND REJUVENATION

Citation
Yn. Huang et al., COMPONENTS FOR SOFTWARE FAULT-TOLERANCE AND REJUVENATION, AT&T technical journal, 75(2), 1996, pp. 29-37
Citations number
11
Categorie Soggetti
Computer Science Hardware & Architecture",Telecommunications
Journal title
ISSN journal
87562324
Volume
75
Issue
2
Year of publication
1996
Pages
29 - 37
Database
ISI
SICI code
8756-2324(1996)75:2<29:CFSFAR>2.0.ZU;2-A
Abstract
Software fault tolerance is the task of detecting and recovering from failures that are not handled in the underlying hardware or operating system layers of an application. Software rejuvenation prevents failur es by periodically, and gracefully, terminating an application and res tarting it at a clean internal state. This paper describes five reusab le software components that provide these capabilities. They perform a utomatic detection and restart of failed processes, checkpointing and recovery of data in memory, replication and synchronization of files, and software rejuvenation. These components, which have been ported to a number of UNIX platforms, can be used in any application with mini mal programming effort. The fault tolerance capabilities of several co mmunication products and services in AT&T have been enhanced by incorp orating these components. Experience with these products to date indic ates that the components provide efficient, economical means to increa se the level of fault tolerance in an application.