This paper investigates the problem of communication errors in distributed
shared memory (DSM) systems. Communication errors can introduce two critica
l problems: damage and loss. The damage problem makes the transmitted data
destroyed and then produces incorrect computational results. The loss probl
em causes the transmitted data lost during transmission and then not receiv
ed. However, the loss problem can be easily resolved using acknowledgement.
Therefore, we focus on how to efficiently handle the damage problem. In DS
M systems, the size of data transferred between nodes is larger than the si
ze actually shared between nodes. That is, when a processing node receives
data, not all the data items in this received data will be used. Based on t
his property, we present a new technique to resolve the data damage problem
in DSM systems. This technique allows a processing node to continue comput
ation without being blocked to wail for the correct data when it receives d
amaged data. Therefore, the latency for handling the data damage can be hid
den. However, there is an optimistic assumption made in the proposed techni
que. If this optimistic assumption is not valid, the latency will not be hi
dden. To show the advantage and the overhead of the proposed technique, we
perform extensive trace-driven simulations. The simulation results show tha
t at least 62% of the latency for handling data damage can be hidden.