Supporting cost-effective fault tolerance in distributed message-passing applications with file operations

Citation
Js. Ouyang et P. Maheshwari, Supporting cost-effective fault tolerance in distributed message-passing applications with file operations, J SUPERCOMP, 14(3), 1999, pp. 207-232
Citations number
42
Categorie Soggetti
Computer Science & Engineering
Journal title
JOURNAL OF SUPERCOMPUTING
ISSN journal
09208542 → ACNP
Volume
14
Issue
3
Year of publication
1999
Pages
207 - 232
Database
ISI
SICI code
0920-8542(1999)14:3<207:SCFTID>2.0.ZU;2-K
Abstract
In this paper we present an approach to reliable distributed computing, whi ch incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reli able application software. In our model fault tolerance is based on distrib uted consistent checkpointing and rollback-recovery integrated with a user- level reliable transmission protocol. By employing novel techniques 8and al gorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead f or constructing a consistent distributed checkpoint and catching messages i n transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpo inting and recovery of persistent state, i.e., user files. Based on the mod el, a software library prototype called Libra has been implemented for supp orting fault tolerance in distributed message-passing applications with fil e operations. The library provides an easy to use programming interface inc luding message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recover ing user files from the application level. Experience with a number of long -running distributed applications shows that Libra can provide fault tolera nce in a cost-effective manner.