Proposed here is a novel architecture for a fault-tolerant multiproces
sor environment. It is assumed that the multiprocessor organization co
nsists of a pool of active processing modules and either a small numbe
r of spare modules or active modules with some spare processing capaci
ty. A fault-tolerance scheme is developed for duplex systems using che
ckpoints. Our scheme, unlike traditional checkpointing schemes, requir
es no rollbacks for recovering from single faults. The objective here
is to achieve performance of a Triple Modular Redundant system using d
uplex system redundancy. In the proposed scheme, at each checkpoint, t
he state of the two modules executing the task is compared for detecti
on of faults. If a disagreement occurs, indicating a fault, the two di
ffering states ace both stored. Instead of performing usual rollback a
nd retry, the following mechanism is used. The state at the preceding
checkpoint, where both processing modules had agreed, is loaded into a
spare module. The checkpoint interval in which the failure is detecte
d is then ''retried'' on the spare module. Concurrently, the task cont
inues forward on the two active modules, beyond the checkpoint where t
he disagreement occurred. At the next checkpoint, the state of the spa
re is compared with the stored states of the two active modules (store
d states correspond to where the disagreement occurred). The active mo
dule which disagrees with the spare is identified to be faulty. Once t
he faulty module is identified, the state of the faulty module is rest
ored to the correct state by copying the state from the other active m
odule, which is fault-free. The spare is released to the pool after re
covery is completed. It is important to note that the spare is shared
among many processor pairs and is used temporarily when faults occur.
Since the above mechanism achieves forward recovery, the proposed sche
me is termed Roll-Forward Checkpointing Scheme (RFCS). The RFCS scheme
allows recovery from single failures without the overhead of rollback
. The advantage of the proposed scheme is that it achieves a lower ave
rage execution time with a lower variance as compared to the rollback
scheme. This can be crucial for real-time systems since lower variance
enhances the predictability of the task completion time.