Cj. Hou et al., EFFECTIVE AND CONCURRENT CHECKPOINTING AND RECOVERY IN DISTRIBUTED SYSTEMS, IEE proceedings. Computers and digital techniques, 144(5), 1997, pp. 304-316
The paper presents an effective application-transparent checkpointing/
rollback scheme for multiple processes that communicate via message pa
ssing in a distributed system. The authors first propose a checkpointi
ng scheme that uses the unforced checkpointing strategy and dynamicall
y varies checkpoint intervals with respect to the frequency of message
sending to reduce process rollback propagation. Additional forced che
ckpoints are taken only to achieve checkpoint consistency among proces
ses and to avoid the domino effect, The authors then discuss both glob
al rollback and minimal rollback approaches, and incorporate them into
the proposed checkpointing scheme. The combined checkpointing/rollbac
k scheme can handle out-of-order messages, achieve high concurrency du
ring checkpointing/rollback operations. and allow multiple invocations
of checkpointing/rollback instances. To reduce the space overhead a g
lobal recovery line determination approach to purge the checkpoints to
which processes shall never rollback is proposed. Experiences with ev
ent-driven simulation indicate that the proposed scheme can effectivel
y reduce rollback propagation, while incurring little control message
overhead and maintaining at any time only a few checkpoints at each pr
ocess.