ITA
ENG

A survey of checkpointing algorithms for parallel and distributed computers

Authors

Kalaiselvi, S Rajaraman, V

Citation

S. Kalaiselvi et V. Rajaraman, A survey of checkpointing algorithms for parallel and distributed computers, SADHANA, 25, 2000, pp. 489-510

Citations number

Categorie Soggetti

Engineering Management /General

Journal title

SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES

ISSN journal

02562499 → ACNP

Volume

Year of publication

2000

Part

Pages

489 - 510

Database

ISI

SICI code

0256-2499(200010)25:<489:ASOCAF>2.0.ZU;2-G

Abstract

Checkpoint is defined as a designated place in a program at which normal pr ocessing is interrupted specifically to preserve the status information nec essary to allow resumption of processing at a later time. Checkpointing is the process of saving the status information. This paper surveys the algori thms which have been reported in the literature for checkpointing parallel/ distributed systems. It has been observed that most of the algorithms publi shed for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been publish ed in this area by relaxing the assumptions made in this paper and by exten ding it to minimise the overheads of coordination and context saving. Check pointing for shared memory systems primarily extend cache coherence protoco ls to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published f or distributed shared memory systems, which extend the cache coherence prot ocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the alg orithms assume that there is no knowledge about the programs being executed . It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery.