HARNESS and fault tolerant MPI

Citation
Ge. Fagg et al., HARNESS and fault tolerant MPI, PARALLEL C, 27(11), 2001, pp. 1479-1495
Citations number
22
Categorie Soggetti
Computer Science & Engineering
Journal title
PARALLEL COMPUTING
ISSN journal
01678191 → ACNP
Volume
27
Issue
11
Year of publication
2001
Pages
1479 - 1495
Database
ISI
SICI code
0167-8191(200110)27:11<1479:HAFTM>2.0.ZU;2-P
Abstract
Initial versions of MPI were designed to work efficiently on multi-processo rs which had very little job control and thus static process models. Subseq uently forcing them to support a dynamic process model would have affected their performance. As current HPC systems increase in size with greater pot ential levels of individual node failure, the need arises for new fault tol erant systems to be developed. Here we present a new implementation of MPI called fault tolerant MPI (FT-MPI) that allows the semantics and associated modes of failures to be explicitly controlled by an application via a modi fied MPI API. Given is an overview of the FT-MPI semantics, design, example applications, debugging tools and some performance issues. Also discussed is the experimental HARNESS core (G_HCORE) implementation that FT-MPI is bu ilt to operate upon. (C) 2001 Elsevier Science B.V. All rights reserved.