ITA
ENG

ROLL-FORWARD CHECKPOINTING SCHEME - A NOVEL FAULT-TOLERANT ARCHITECTURE

Authors

PRADHAN DK VAIDYA NH

Citation

Dk. Pradhan et Nh. Vaidya, ROLL-FORWARD CHECKPOINTING SCHEME - A NOVEL FAULT-TOLERANT ARCHITECTURE, I.E.E.E. transactions on computers, 43(10), 1994, pp. 1163-1174

Citations number

Categorie Soggetti

Computer Sciences","Engineering, Eletrical & Electronic","Computer Science Hardware & Architecture

Journal title

I.E.E.E. transactions on computers → ACNP

ISSN journal

00189340

Volume

Issue

Year of publication

1994

Pages

1163 - 1174

Database

ISI

SICI code

0018-9340(1994)43:10<1163:RCS-AN>2.0.ZU;2-Z

Abstract

Proposed here is a novel architecture for a fault-tolerant multiproces sor environment. It is assumed that the multiprocessor organization co nsists of a pool of active processing modules and either a small numbe r of spare modules or active modules with some spare processing capaci ty. A fault-tolerance scheme is developed for duplex systems using che ckpoints. Our scheme, unlike traditional checkpointing schemes, requir es no rollbacks for recovering from single faults. The objective here is to achieve performance of a Triple Modular Redundant system using d uplex system redundancy. In the proposed scheme, at each checkpoint, t he state of the two modules executing the task is compared for detecti on of faults. If a disagreement occurs, indicating a fault, the two di ffering states ace both stored. Instead of performing usual rollback a nd retry, the following mechanism is used. The state at the preceding checkpoint, where both processing modules had agreed, is loaded into a spare module. The checkpoint interval in which the failure is detecte d is then ''retried'' on the spare module. Concurrently, the task cont inues forward on the two active modules, beyond the checkpoint where t he disagreement occurred. At the next checkpoint, the state of the spa re is compared with the stored states of the two active modules (store d states correspond to where the disagreement occurred). The active mo dule which disagrees with the spare is identified to be faulty. Once t he faulty module is identified, the state of the faulty module is rest ored to the correct state by copying the state from the other active m odule, which is fault-free. The spare is released to the pool after re covery is completed. It is important to note that the spare is shared among many processor pairs and is used temporarily when faults occur. Since the above mechanism achieves forward recovery, the proposed sche me is termed Roll-Forward Checkpointing Scheme (RFCS). The RFCS scheme allows recovery from single failures without the overhead of rollback . The advantage of the proposed scheme is that it achieves a lower ave rage execution time with a lower variance as compared to the rollback scheme. This can be crucial for real-time systems since lower variance enhances the predictability of the task completion time.