G. Muller et al., LESSONS FROM FTM - AN EXPERIMENT IN THE DESIGN AND IMPLEMENTATION OF A LOW-COST FAULT-TOLERANT SYSTEM, IEEE transactions on reliability, 45(2), 1996, pp. 332-340
This paper describes an experiment in the design of a general purpose
fault tolerant system, FTM. The main objective of the FTM design was t
o implement a low-cost fault-tolerant system that could be used on sta
ndard workstations, At the operating system level, our goal was to off
er fault-tolerance transparency to user applications, In other words,
porting an application to FTM need only require compiling the source c
ode without having to modify it, These objectives were achieved using
the Mach micro-kernel and a modular set of reliable servers which impl
ement application checkpoints and provide continuous system functions
despite machine crashes. At the architectural level, our approach reli
es on a high-performance stable storage implementation, called Stable
Transactional Memory (STM), which can be implemented either by hardwar
e or software, We first motivate our design choices, then we detail th
e FTM implementation at both architectural and operating system level.
We discuss the reasons for the evolution of our stable memory technol
ogy from hardware to software; We evaluate the performance of the FTM
prototype, We conclude with lessons learned and give some assessments.