EFFECTIVE AND CONCURRENT CHECKPOINTING AND RECOVERY IN DISTRIBUTED SYSTEMS

Citation
Cj. Hou et al., EFFECTIVE AND CONCURRENT CHECKPOINTING AND RECOVERY IN DISTRIBUTED SYSTEMS, IEE proceedings. Computers and digital techniques, 144(5), 1997, pp. 304-316
Citations number
29
ISSN journal
13502387
Volume
144
Issue
5
Year of publication
1997
Pages
304 - 316
Database
ISI
SICI code
1350-2387(1997)144:5<304:EACCAR>2.0.ZU;2-7
Abstract
The paper presents an effective application-transparent checkpointing/ rollback scheme for multiple processes that communicate via message pa ssing in a distributed system. The authors first propose a checkpointi ng scheme that uses the unforced checkpointing strategy and dynamicall y varies checkpoint intervals with respect to the frequency of message sending to reduce process rollback propagation. Additional forced che ckpoints are taken only to achieve checkpoint consistency among proces ses and to avoid the domino effect, The authors then discuss both glob al rollback and minimal rollback approaches, and incorporate them into the proposed checkpointing scheme. The combined checkpointing/rollbac k scheme can handle out-of-order messages, achieve high concurrency du ring checkpointing/rollback operations. and allow multiple invocations of checkpointing/rollback instances. To reduce the space overhead a g lobal recovery line determination approach to purge the checkpoints to which processes shall never rollback is proposed. Experiences with ev ent-driven simulation indicate that the proposed scheme can effectivel y reduce rollback propagation, while incurring little control message overhead and maintaining at any time only a few checkpoints at each pr ocess.