Distributed shared memory (DSM) is a very promising programming model
for exploiting the parallelism of distributed memory systems, because
it provides a higher level of abstraction than simple message passing.
Although the nodes of standard distributed systems exhibit high crash
rates only very few DSM environments have some kind of support for fa
ult-tolerance. In this article, we present a checkpointing mechanism f
or a DSM system that is efficient and portable. It offers some portabi
lity because it is built on top of MPI and uses only the services offe
red by MPI and a POSIX compliant local file system. As far as we know,
this is the first real implementation of such a scheme for DSM. Along
with the description of the algorithm we present experimental results
obtained in a cluster of workstations. We hope that our research show
s that efficient, transparent and portable checkpointing is viable for
DSM systems.