M. Banatre et al., AN ARCHITECTURE FOR TOLERATING PROCESSOR FAILURES IN SHARED-MEMORY MULTIPROCESSORS, I.E.E.E. transactions on computers, 45(10), 1996, pp. 1101-1115
This paper focuses on the problem of fault tolerance in shared memory
multiprocessors, and describes an architecture designed for transparen
tly tolerating processor failures. The Recoverable Shared Memory (RSM)
is the novel component of this architecture, providing a hardware sup
ported backward error recovery mechanism which minimizes the propagati
on of recovery when a processor fails. The RSM permits a shared memory
multiprocessor to be constructed using standard caches and cache cohe
rence protocols, and does not require any changes to be made to applic
ations software. The performance of the recovery scheme supported by t
he RSM is evaluated and compared with other schemes that have been pro
posed for fault tolerant shared memory multiprocessors. The performanc
e study has been conducted by simulation using address traces collecte
d from real parallel applications.