Hb. Kim et Kg. Shin, DESIGN AND ANALYSIS OF AN OPTIMAL INSTRUCTION RETRY POLICY FOR TMR CONTROLLER COMPUTERS, I.E.E.E. transactions on computers, 45(11), 1996, pp. 1217-1225
An instruction-retry policy is proposed to enhance the fault-tolerance
of triple modular redundant (TMR) controller computers by adding time
redundancy to them. A TMR failure is said to occur if a TMR system fa
ils to establish a majority among its modules' outputs due to multiple
faulty modules or a faulty voter. Either multiple consecutive TMR fai
lures the active period of which exceeds a certain time limit or the e
xhaustion of spares as a result of frequent system reconfigurations ma
y result in failure to meet the timing constraints of one or more task
s, called the dynamic failure, during a given mission. An optimal inst
ruction-retry period is derived by minimizing the probability of dynam
ic failure upon detection of either a masked (by the TMR) error or a T
MR failure. We also derive the minimum number of spares needed to keep
below the pre-specified level the probability of dynamic failure for
a given mission by using the derived optimal retry period.