R. Baldoni et al., ADAPTIVE CHECKPOINTING IN MESSAGE-PASSING DISTRIBUTED SYSTEMS, International Journal of Systems Science, 28(11), 1997, pp. 1145-1161
Determining consistent global checkpoints is common to many distribute
d problems such as fault-tolerance, distributed debugging, properties
detection, etc. Uncoordinated and coordinated checkpointing algorithms
have been traditionally used for such determinations. This paper addr
ess a third technique, namely adaptive checkpointing, that has recentl
y emerged. This technique assumes processes take local checkpoints ind
ependently and requires them to take additional local checkpoints in o
rder that all local checkpoints be members of some consistent global c
heckpoint. We first study the characteristics of such adaptive algorit
hms. Then, a general adaptive checkpointing algorithm is designed from
a condition, first stated by Netzer and Xu, that answers the followin
g question: 'does a given local checkpoint belong to a consistent glob
al checkpoint?' (such a local checkpoint is not useless). The resultin
g algorithm has the nice property to reduce the number of additional l
ocal checkpoints taken to ensure the property 'no local checkpoint is
useless'. Furthermore, it provides each local checkpoint with a consis
tent global checkpoint to which it belongs. Compared to uncoordinated
and coordinated checkpointing algorithms, this algorithm combines the
advantages of both without inheriting their drawbacks.