Hierarchical error detection in a software implemented fault tolerance (SIFT) environment

Citation
S. Bagchi et al., Hierarchical error detection in a software implemented fault tolerance (SIFT) environment, IEEE KNOWL, 12(2), 2000, pp. 203-224
Citations number
28
Categorie Soggetti
AI Robotics and Automatic Control
Journal title
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
ISSN journal
10414347 → ACNP
Volume
12
Issue
2
Year of publication
2000
Pages
203 - 224
Database
ISI
SICI code
1041-4347(200003/04)12:2<203:HEDIAS>2.0.ZU;2-K
Abstract
This paper proposes a hierarchical error detection framework for a Software Implemented Fault Tolerance (SIFT) layer of a distributed system. A four-l evel error detection hierarchy is proposed in the context of Chameleon, a s oftware environment for providing adaptive fault-tolerance in an environmen t of commercial off-the-shelf (COTS) system components and software. The de sign and implementation of a software-based distributed signature monitorin g scheme, which is central to the proposed four-level hierarchy, is describ ed. Both intralevel and interlevel optimizations that minimize the overhead of detection and are capable of adapting to runtime requirements are propo sed. The paper presents results from a prototype implementation of two leve ls of the error detection hierarchy and results of a detailed simulation of the overall environment. The results indicate a substantial increase in av ailability due to the detection framework and help in understanding the tra de-offs between overhead and coverage for different combinations of techniq ues.