This paper presents selective checkpointing and rollback schemes for MT-GO
(multithreaded, object-oriented) programs. There is a need for checkpointin
g mechanisms that are more sophisticated than the traditional process-level
checkpointing. The program model, theoretical foundations, and an implemen
tation of the selective checkpointing & rollback schemes are described. The
usefulness of the schemes is demonstrated by implementing a higher level f
ault-tolerance scheme of conversations using them. The performance implicat
ions are studied on a prototype internet e-commerce server. The use of the
selective schemes in the prototype server showed an appreciable reduction i
n the loss of work in the presence of faults. Benefits are more pronounced
for a larger level of concurrency in the server. The selective scheme usual
ly outperforms the hypothetical zero-cost global scheme in the presence of
faults, vis-a-vis completion times. The experiments also show the vast diff
erence between the sizes of selective checkpoints and global checkpoints. T
he concurrent sessions scheme (based on the concept of relaxed conversation
s) required 160 checkpoints in less than an hour. Traditionally, such a sch
eme would be considered outrageous, but the selective schemes still improve
performance in the presence of faults.
The main contribution of this paper is that it brings forward an OO (object
-oriented) approach to checkpointing. Not only does the program model separ
ate program state from process state, but it allows one to identify the sta
te associated with each individual thread of the MT program. The prototype
showed that this abstract knowledge about the program state can be made ava
ilable at runtime in the form of suitable data structures. The availability
of this information at runtime fuels the design of selective schemes.