This paper proposes a hierarchical error detection framework for a Software
Implemented Fault Tolerance (SIFT) layer of a distributed system. A four-l
evel error detection hierarchy is proposed in the context of Chameleon, a s
oftware environment for providing adaptive fault-tolerance in an environmen
t of commercial off-the-shelf (COTS) system components and software. The de
sign and implementation of a software-based distributed signature monitorin
g scheme, which is central to the proposed four-level hierarchy, is describ
ed. Both intralevel and interlevel optimizations that minimize the overhead
of detection and are capable of adapting to runtime requirements are propo
sed. The paper presents results from a prototype implementation of two leve
ls of the error detection hierarchy and results of a detailed simulation of
the overall environment. The results indicate a substantial increase in av
ailability due to the detection framework and help in understanding the tra
de-offs between overhead and coverage for different combinations of techniq
ues.