Long-running applications are often subject to failures. Failures can
result in significant loss of computation. Therefore, it is necessary
to use a failure recovery scheme to minimize performance overhead in t
he presence of failures. In this paper, we argue that it is often adva
ntageous to use ''two-level'' recovery schemes. A two-level recovery s
cheme tolerates the more probable failures with low performance overhe
ad, while the less probable failures may possibly incur a higher overh
ead. By minimizing overhead for the more frequently occurring failure
scenarios, the two-level approach can achieve lower performance overhe
ad (on average) as compared to existing recovery schemes. The paper de
scribes two two-level recovery schemes. Performance analysis using a M
arkov chain shows that, in practice, a two-level scheme can perform be
tter than its ''one-level'' counterpart. While the conclusions of this
paper are intuitive, the work on design of appropriate recovery schem
es is lacking. The objective of this paper is to motivate research int
o recovery schemes that can provide multiple levels of fault tolerance
and achieve better performance than existing recovery schemes. The pa
per presents an analytical approach for evaluating performance of two-
level schemes and shows that such schemes are hard to optimize analyti
cally.