ITA
ENG

HARNESS and fault tolerant MPI

Authors

Fagg, GE Bukovsky, A Dongarra, JJ

Citation

Ge. Fagg et al., HARNESS and fault tolerant MPI, PARALLEL C, 27(11), 2001, pp. 1479-1495

Citations number

Categorie Soggetti

Computer Science & Engineering

Journal title

PARALLEL COMPUTING

ISSN journal

01678191 → ACNP

Volume

Issue

Year of publication

2001

Pages

1479 - 1495

Database

ISI

SICI code

0167-8191(200110)27:11<1479:HAFTM>2.0.ZU;2-P

Abstract

Initial versions of MPI were designed to work efficiently on multi-processo rs which had very little job control and thus static process models. Subseq uently forcing them to support a dynamic process model would have affected their performance. As current HPC systems increase in size with greater pot ential levels of individual node failure, the need arises for new fault tol erant systems to be developed. Here we present a new implementation of MPI called fault tolerant MPI (FT-MPI) that allows the semantics and associated modes of failures to be explicitly controlled by an application via a modi fied MPI API. Given is an overview of the FT-MPI semantics, design, example applications, debugging tools and some performance issues. Also discussed is the experimental HARNESS core (G_HCORE) implementation that FT-MPI is bu ilt to operate upon. (C) 2001 Elsevier Science B.V. All rights reserved.