Design and analysis of an integrated checkpointing and recovery scheme fordistributed applications

Citation
B. Ramamurthy et al., Design and analysis of an integrated checkpointing and recovery scheme fordistributed applications, IEEE KNOWL, 12(2), 2000, pp. 174-186
Citations number
24
Categorie Soggetti
AI Robotics and Automatic Control
Journal title
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
ISSN journal
10414347 → ACNP
Volume
12
Issue
2
Year of publication
2000
Pages
174 - 186
Database
ISI
SICI code
1041-4347(200003/04)12:2<174:DAAOAI>2.0.ZU;2-U
Abstract
An integrated checkpointing and recovery scheme which exploits the low late ncy and high coverage characteristics of a concurrent error detection schem e is presented. Message dependency which is the main source of multistep ro llback in distributed systems is minimized by using a new message validatio n technique derived from the notion of concurrent error detection. The conc ept of a new global state matrix is introduced to track error checking and message dependency in a distributed system and assist in the recovery. The analytical model, algorithms, and data structures to support an easy implem entation of the new scheme are presented. The completeness and correctness of the algorithms are proved. A number of scenarios and illustrations that give the details of the analytical model are presented. The benefits of the integrated checkpointing scheme are quantified by means of simulation usin g an object-oriented test framework.