Chameleon: A software infrastructure for adaptive fault tolerance

Citation
Zt. Kalbarczyk et al., Chameleon: A software infrastructure for adaptive fault tolerance, IEEE PARALL, 10(6), 1999, pp. 560-579
Citations number
34
Categorie Soggetti
Computer Science & Engineering
Journal title
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
ISSN journal
10459219 → ACNP
Volume
10
Issue
6
Year of publication
1999
Pages
560 - 579
Database
ISI
SICI code
1045-9219(199906)10:6<560:CASIFA>2.0.ZU;2-H
Abstract
This paper presents Chameleon, an adaptive infrastructure, which allows dif ferent levels of availability requirements to be simultaneously supported i n a networked environment. Chameleon provides dependability through the use of special ARMORs-Adaptive, Reconfigurable, and Mobile Objects for Reliabi lity-that control all operations in the Chameleon environment. Three broad classes of ARMORs are defined: 1) Managers oversee other ARMORs and recover from failures in their subordinates. 2) Daemons provide communication gate ways to the ARMORs at the host node. They also make available a host's reso urces to the Chameleon environment. 3) Common ARMORs implement specific tec hniques for providing application-required dependability. Employing ARMORs, Chameleon makes available different fault-tolerant configurations and main tains run-time adaptation to changes in the availability requirements of an application; Flexible ARMOR architecture allows their composition to be re configured at run-time, i.e., the ARMORS may dynamically adapt to changing application requirements. In this paper, we describe ARMOR architecture, in cluding ARMOR class hierarchy, basic building blocks, ARMOR composition, an d use of ARMOR factories. We present how ARMORs can be reconfigured and ree ngineered and demonstrate how the architecture serves our objective of prov iding an adaptive software infrastructure. To our knowledge, Chameleon is o ne of the few real implementations which enables multiple fault tolerance s trategies to exist in the same environment and supports fault-tolerant exec ution of substantially off-the-shelf applications via a software infrastruc ture only. Chameleon provides fault tolerance from the application's point of view as well as from the software infrastructure's point of view. To dem onstrate the Chameleon capabilities, we have implemented a prototype infras tructure which provides set of ARMORs to initialize the environment and to support the dual and TMR application execution modes. Through this testbed environment, we measure the execution overhead and recovery times from fail ures in the user application, the Chameleon ARMORs, the hardware, and the o perating system.