ADAPTIVE CHECKPOINTING IN MESSAGE-PASSING DISTRIBUTED SYSTEMS

Citation
R. Baldoni et al., ADAPTIVE CHECKPOINTING IN MESSAGE-PASSING DISTRIBUTED SYSTEMS, International Journal of Systems Science, 28(11), 1997, pp. 1145-1161
Citations number
21
ISSN journal
00207721
Volume
28
Issue
11
Year of publication
1997
Pages
1145 - 1161
Database
ISI
SICI code
0020-7721(1997)28:11<1145:ACIMDS>2.0.ZU;2-Z
Abstract
Determining consistent global checkpoints is common to many distribute d problems such as fault-tolerance, distributed debugging, properties detection, etc. Uncoordinated and coordinated checkpointing algorithms have been traditionally used for such determinations. This paper addr ess a third technique, namely adaptive checkpointing, that has recentl y emerged. This technique assumes processes take local checkpoints ind ependently and requires them to take additional local checkpoints in o rder that all local checkpoints be members of some consistent global c heckpoint. We first study the characteristics of such adaptive algorit hms. Then, a general adaptive checkpointing algorithm is designed from a condition, first stated by Netzer and Xu, that answers the followin g question: 'does a given local checkpoint belong to a consistent glob al checkpoint?' (such a local checkpoint is not useless). The resultin g algorithm has the nice property to reduce the number of additional l ocal checkpoints taken to ensure the property 'no local checkpoint is useless'. Furthermore, it provides each local checkpoint with a consis tent global checkpoint to which it belongs. Compared to uncoordinated and coordinated checkpointing algorithms, this algorithm combines the advantages of both without inheriting their drawbacks.