ITA
ENG

EFFECTIVE AND CONCURRENT CHECKPOINTING AND RECOVERY IN DISTRIBUTED SYSTEMS

Authors

HOU CJ TSOI KS HAN CC

Citation

Cj. Hou et al., EFFECTIVE AND CONCURRENT CHECKPOINTING AND RECOVERY IN DISTRIBUTED SYSTEMS, IEE proceedings. Computers and digital techniques, 144(5), 1997, pp. 304-316

Citations number

Journal title

IEE proceedings. Computers and digital techniques → ACNP

ISSN journal

13502387

Volume

144

Issue

Year of publication

1997

Pages

304 - 316

Database

ISI

SICI code

1350-2387(1997)144:5<304:EACCAR>2.0.ZU;2-7

Abstract

The paper presents an effective application-transparent checkpointing/ rollback scheme for multiple processes that communicate via message pa ssing in a distributed system. The authors first propose a checkpointi ng scheme that uses the unforced checkpointing strategy and dynamicall y varies checkpoint intervals with respect to the frequency of message sending to reduce process rollback propagation. Additional forced che ckpoints are taken only to achieve checkpoint consistency among proces ses and to avoid the domino effect, The authors then discuss both glob al rollback and minimal rollback approaches, and incorporate them into the proposed checkpointing scheme. The combined checkpointing/rollbac k scheme can handle out-of-order messages, achieve high concurrency du ring checkpointing/rollback operations. and allow multiple invocations of checkpointing/rollback instances. To reduce the space overhead a g lobal recovery line determination approach to purge the checkpoints to which processes shall never rollback is proposed. Experiences with ev ent-driven simulation indicate that the proposed scheme can effectivel y reduce rollback propagation, while incurring little control message overhead and maintaining at any time only a few checkpoints at each pr ocess.