AN ARCHITECTURE FOR TOLERATING PROCESSOR FAILURES IN SHARED-MEMORY MULTIPROCESSORS

Citation
M. Banatre et al., AN ARCHITECTURE FOR TOLERATING PROCESSOR FAILURES IN SHARED-MEMORY MULTIPROCESSORS, I.E.E.E. transactions on computers, 45(10), 1996, pp. 1101-1115
Citations number
40
Categorie Soggetti
Computer Sciences","Engineering, Eletrical & Electronic","Computer Science Hardware & Architecture
ISSN journal
00189340
Volume
45
Issue
10
Year of publication
1996
Pages
1101 - 1115
Database
ISI
SICI code
0018-9340(1996)45:10<1101:AAFTPF>2.0.ZU;2-2
Abstract
This paper focuses on the problem of fault tolerance in shared memory multiprocessors, and describes an architecture designed for transparen tly tolerating processor failures. The Recoverable Shared Memory (RSM) is the novel component of this architecture, providing a hardware sup ported backward error recovery mechanism which minimizes the propagati on of recovery when a processor fails. The RSM permits a shared memory multiprocessor to be constructed using standard caches and cache cohe rence protocols, and does not require any changes to be made to applic ations software. The performance of the recovery scheme supported by t he RSM is evaluated and compared with other schemes that have been pro posed for fault tolerant shared memory multiprocessors. The performanc e study has been conducted by simulation using address traces collecte d from real parallel applications.