Selective checkpointing and rollbacks in multi-threaded object-oriented environment

Citation
M. Kasbekar et al., Selective checkpointing and rollbacks in multi-threaded object-oriented environment, IEEE RELIAB, 48(4), 1999, pp. 325-337
Citations number
24
Categorie Soggetti
Eletrical & Eletronics Engineeing
Journal title
IEEE TRANSACTIONS ON RELIABILITY
ISSN journal
00189529 → ACNP
Volume
48
Issue
4
Year of publication
1999
Pages
325 - 337
Database
ISI
SICI code
0018-9529(199912)48:4<325:SCARIM>2.0.ZU;2-F
Abstract
This paper presents selective checkpointing and rollback schemes for MT-GO (multithreaded, object-oriented) programs. There is a need for checkpointin g mechanisms that are more sophisticated than the traditional process-level checkpointing. The program model, theoretical foundations, and an implemen tation of the selective checkpointing & rollback schemes are described. The usefulness of the schemes is demonstrated by implementing a higher level f ault-tolerance scheme of conversations using them. The performance implicat ions are studied on a prototype internet e-commerce server. The use of the selective schemes in the prototype server showed an appreciable reduction i n the loss of work in the presence of faults. Benefits are more pronounced for a larger level of concurrency in the server. The selective scheme usual ly outperforms the hypothetical zero-cost global scheme in the presence of faults, vis-a-vis completion times. The experiments also show the vast diff erence between the sizes of selective checkpoints and global checkpoints. T he concurrent sessions scheme (based on the concept of relaxed conversation s) required 160 checkpoints in less than an hour. Traditionally, such a sch eme would be considered outrageous, but the selective schemes still improve performance in the presence of faults. The main contribution of this paper is that it brings forward an OO (object -oriented) approach to checkpointing. Not only does the program model separ ate program state from process state, but it allows one to identify the sta te associated with each individual thread of the MT program. The prototype showed that this abstract knowledge about the program state can be made ava ilable at runtime in the form of suitable data structures. The availability of this information at runtime fuels the design of selective schemes.