ROLL-FORWARD CHECKPOINTING SCHEME - A NOVEL FAULT-TOLERANT ARCHITECTURE

Citation
Dk. Pradhan et Nh. Vaidya, ROLL-FORWARD CHECKPOINTING SCHEME - A NOVEL FAULT-TOLERANT ARCHITECTURE, I.E.E.E. transactions on computers, 43(10), 1994, pp. 1163-1174
Citations number
13
Categorie Soggetti
Computer Sciences","Engineering, Eletrical & Electronic","Computer Science Hardware & Architecture
ISSN journal
00189340
Volume
43
Issue
10
Year of publication
1994
Pages
1163 - 1174
Database
ISI
SICI code
0018-9340(1994)43:10<1163:RCS-AN>2.0.ZU;2-Z
Abstract
Proposed here is a novel architecture for a fault-tolerant multiproces sor environment. It is assumed that the multiprocessor organization co nsists of a pool of active processing modules and either a small numbe r of spare modules or active modules with some spare processing capaci ty. A fault-tolerance scheme is developed for duplex systems using che ckpoints. Our scheme, unlike traditional checkpointing schemes, requir es no rollbacks for recovering from single faults. The objective here is to achieve performance of a Triple Modular Redundant system using d uplex system redundancy. In the proposed scheme, at each checkpoint, t he state of the two modules executing the task is compared for detecti on of faults. If a disagreement occurs, indicating a fault, the two di ffering states ace both stored. Instead of performing usual rollback a nd retry, the following mechanism is used. The state at the preceding checkpoint, where both processing modules had agreed, is loaded into a spare module. The checkpoint interval in which the failure is detecte d is then ''retried'' on the spare module. Concurrently, the task cont inues forward on the two active modules, beyond the checkpoint where t he disagreement occurred. At the next checkpoint, the state of the spa re is compared with the stored states of the two active modules (store d states correspond to where the disagreement occurred). The active mo dule which disagrees with the spare is identified to be faulty. Once t he faulty module is identified, the state of the faulty module is rest ored to the correct state by copying the state from the other active m odule, which is fault-free. The spare is released to the pool after re covery is completed. It is important to note that the spare is shared among many processor pairs and is used temporarily when faults occur. Since the above mechanism achieves forward recovery, the proposed sche me is termed Roll-Forward Checkpointing Scheme (RFCS). The RFCS scheme allows recovery from single failures without the overhead of rollback . The advantage of the proposed scheme is that it achieves a lower ave rage execution time with a lower variance as compared to the rollback scheme. This can be crucial for real-time systems since lower variance enhances the predictability of the task completion time.