UNITS OF COMPUTATION IN FAULT-TOLERANT DISTRIBUTED SYSTEMS

Authors
Citation
M. Ahuja et S. Mishra, UNITS OF COMPUTATION IN FAULT-TOLERANT DISTRIBUTED SYSTEMS, Journal of parallel and distributed computing, 40(2), 1997, pp. 194-209
Citations number
30
Categorie Soggetti
Computer Sciences","Computer Science Theory & Methods
ISSN journal
07437315
Volume
40
Issue
2
Year of publication
1997
Pages
194 - 209
Database
ISI
SICI code
0743-7315(1997)40:2<194:UOCIFD>2.0.ZU;2-6
Abstract
We develop a framework that helps in understanding a fault-tolerant di stributed system and so aids in designing such systems. We illustrate the uses of the developed work in application areas such as checkpoint ing and recovery, phase termination detection, stable property detecti on, implementing membership protocols, debugging, and design of progra mming languages. We define a unit of computation, and refer to it as a molecule. A molecule has a well defined interface with other molecule s. The smallest such unit-an indivisible molecule-is termed an atom. W e show that any execution of a fault-tolerant distributed computation can be seen as an execution of molecules/atoms in a partial order, and such a view provides insights into understanding the computation, par ticularly for a fault-tolerant system where it is important to guarant ee that a unit of computation is either completely executed or not at all and system designers need to reason about the states after executi on of such units. Molecules are essentially a generalization of atomic actions. (C) 1997 Academic Press.