Initial versions of MPI were designed to work efficiently on multi-processo
rs which had very little job control and thus static process models. Subseq
uently forcing them to support a dynamic process model would have affected
their performance. As current HPC systems increase in size with greater pot
ential levels of individual node failure, the need arises for new fault tol
erant systems to be developed. Here we present a new implementation of MPI
called fault tolerant MPI (FT-MPI) that allows the semantics and associated
modes of failures to be explicitly controlled by an application via a modi
fied MPI API. Given is an overview of the FT-MPI semantics, design, example
applications, debugging tools and some performance issues. Also discussed
is the experimental HARNESS core (G_HCORE) implementation that FT-MPI is bu
ilt to operate upon. (C) 2001 Elsevier Science B.V. All rights reserved.