Z. Kalbarczyk et al., Hierarchical simulation approach to accurate fault modeling for system dependability evaluation, IEEE SOFT E, 25(5), 1999, pp. 619-632
This paper presents a hierarchical simulation methodology that enables accu
rate system evaluation under realistic faults and conditions. In this metho
dology, effects of low-level (i.e., transistor or circuit level) faults are
propagated to higher levels (i.e., system level) using fault dictionaries.
The primary fault models are obtained via simulation of the transistor-lev
el effect of a radiation particle penetrating a device. The resulting curre
nt bursts constitute the first-level fault dictionary and are used in the c
ircuit-level simulation to determine the impact on circuit latches and flip
-flops. The latched outputs constitute the next level fault dictionary in t
he hierarchy and are applied in conducting fault injection simulation at th
e chip-level under selected workloads or application programs. Faults injec
ted at the chip-level result in memory corruptions, which are used to form
the next level fault dictionary for the system-level simulation of an appli
cation running on simulated hardware. When an application terminates, eithe
r normally or abnormally, the overall fault impact on the software behavior
is quantified and analyzed. The system in this sense can be a single works
tation or a network. The simulation method is demonstrated and validated in
the case study of Myrinet (a commercial, high-speed network) based network
system. The study shows that the method: 1) allows detailed simulation of
faults at lower levels and effective fault propagation through the system t
o the user-visible higher levels using fault dictionaries, 2) links physica
l faults with effects that the user can observe at the higher levels and th
us provides a foundation for realistic fault injection studies, 3) allows s
ignificant reduction in the number of simulations needed, due to the fault
dictionary method, 4) offers a high confidence in the evaluation results be
cause the system is analyzed in presence of realistic fault conditions, and
5) provides Valuable feedback for designing error recovery mechanisms to i
mprove dependability.