CHECKPOINTING DISTRIBUTED SHARED-MEMORY

Authors
Citation
Lm. Silva et Jg. Silva, CHECKPOINTING DISTRIBUTED SHARED-MEMORY, Journal of supercomputing, 11(2), 1997, pp. 137-158
Citations number
32
Categorie Soggetti
Computer Sciences","Engineering, Eletrical & Electronic","Computer Science Hardware & Architecture","Computer Science Theory & Methods
Journal title
ISSN journal
09208542
Volume
11
Issue
2
Year of publication
1997
Pages
137 - 158
Database
ISI
SICI code
0920-8542(1997)11:2<137:CDS>2.0.ZU;2-J
Abstract
Distributed shared memory (DSM) is a very promising programming model for exploiting the parallelism of distributed memory systems, because it provides a higher level of abstraction than simple message passing. Although the nodes of standard distributed systems exhibit high crash rates only very few DSM environments have some kind of support for fa ult-tolerance. In this article, we present a checkpointing mechanism f or a DSM system that is efficient and portable. It offers some portabi lity because it is built on top of MPI and uses only the services offe red by MPI and a POSIX compliant local file system. As far as we know, this is the first real implementation of such a scheme for DSM. Along with the description of the algorithm we present experimental results obtained in a cluster of workstations. We hope that our research show s that efficient, transparent and portable checkpointing is viable for DSM systems.