ENSURING CORRECT ROLLBACK RECOVERY IN DISTRIBUTED SHARED-MEMORY SYSTEMS

Citation
B. Janssens et Wk. Fuchs, ENSURING CORRECT ROLLBACK RECOVERY IN DISTRIBUTED SHARED-MEMORY SYSTEMS, Journal of parallel and distributed computing, 29(2), 1995, pp. 211-218
Citations number
24
Categorie Soggetti
Computer Sciences","Computer Science Theory & Methods
ISSN journal
07437315
Volume
29
Issue
2
Year of publication
1995
Pages
211 - 218
Database
ISI
SICI code
0743-7315(1995)29:2<211:ECRRID>2.0.ZU;2-5
Abstract
Distributed shared memory (DSM) implemented on a cluster of workstatio ns is an increasingly attractive platform for executing parallel scien tific applications. Checkpointing and rollback techniques can be used in such a system to allow the computation to progress in spite of the temporary failure of one or more processing nodes. This paper presents the design of an independent checkpointing method for DSM that takes advantage of DSM's specific properties to reduce error-free and rollba ck overhead. The scheme reduces the dependencies that need to be consi dered for correct rollback to those resulting from transfers of pages. Furthermore, in-transit messages can be recovered without the use of logging. We extend the scheme to a DSM implementation using lazy relea se consistency, where the frequency of dependencies is further reduced . (C) 1995 Academic Press, Inc.