B. Janssens et Wk. Fuchs, ENSURING CORRECT ROLLBACK RECOVERY IN DISTRIBUTED SHARED-MEMORY SYSTEMS, Journal of parallel and distributed computing, 29(2), 1995, pp. 211-218
Citations number
24
Categorie Soggetti
Computer Sciences","Computer Science Theory & Methods
Distributed shared memory (DSM) implemented on a cluster of workstatio
ns is an increasingly attractive platform for executing parallel scien
tific applications. Checkpointing and rollback techniques can be used
in such a system to allow the computation to progress in spite of the
temporary failure of one or more processing nodes. This paper presents
the design of an independent checkpointing method for DSM that takes
advantage of DSM's specific properties to reduce error-free and rollba
ck overhead. The scheme reduces the dependencies that need to be consi
dered for correct rollback to those resulting from transfers of pages.
Furthermore, in-transit messages can be recovered without the use of
logging. We extend the scheme to a DSM implementation using lazy relea
se consistency, where the frequency of dependencies is further reduced
. (C) 1995 Academic Press, Inc.