Application-level fault tolerance as a complement to system-level fault tolerance

Citation
J. Haines et al., Application-level fault tolerance as a complement to system-level fault tolerance, J SUPERCOMP, 16(1), 2000, pp. 53-68
Citations number
10
Categorie Soggetti
Computer Science & Engineering
Journal title
JOURNAL OF SUPERCOMPUTING
ISSN journal
09208542 → ACNP
Volume
16
Issue
1
Year of publication
2000
Pages
53 - 68
Database
ISI
SICI code
0920-8542(200005)16:1<53:AFTAAC>2.0.ZU;2-Q
Abstract
As multiprocessor systems become more complex, their reliability will need to increase as well. In this paper we propose a novel technique which is ap plicable to a wide variety of distributed real-time systems, especially tho se exhibiting data parallelism. System-level fault tolerance involves relia bility techniques incorporated within the system hardware and software wher eas application-level fault tolerance involves reliability techniques incor porated within the application software. We assert that, for high reliabili ty, a combination of system-level fault tolerance and application-level fau lt tolerance works best. In many systems, application-level fault tolerance can be used to bridge the gap when system-level fault tolerance alone does not provide the required reliability. We exemplify this with the RTHT targ et tracking benchmark and the ABF beamforming benchmark.