Efficient recovery from communication errors in distributed shared memory systems

Authors
Citation
Jw. Lin et Sy. Kuo, Efficient recovery from communication errors in distributed shared memory systems, IEICE T INF, E81D(11), 1998, pp. 1213-1223
Citations number
15
Categorie Soggetti
Information Tecnology & Communication Systems
Journal title
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS
ISSN journal
09168532 → ACNP
Volume
E81D
Issue
11
Year of publication
1998
Pages
1213 - 1223
Database
ISI
SICI code
0916-8532(199811)E81D:11<1213:ERFCEI>2.0.ZU;2-O
Abstract
This paper investigates the problem of communication errors in distributed shared memory (DSM) systems. Communication errors can introduce two critica l problems: damage and loss. The damage problem makes the transmitted data destroyed and then produces incorrect computational results. The loss probl em causes the transmitted data lost during transmission and then not receiv ed. However, the loss problem can be easily resolved using acknowledgement. Therefore, we focus on how to efficiently handle the damage problem. In DS M systems, the size of data transferred between nodes is larger than the si ze actually shared between nodes. That is, when a processing node receives data, not all the data items in this received data will be used. Based on t his property, we present a new technique to resolve the data damage problem in DSM systems. This technique allows a processing node to continue comput ation without being blocked to wail for the correct data when it receives d amaged data. Therefore, the latency for handling the data damage can be hid den. However, there is an optimistic assumption made in the proposed techni que. If this optimistic assumption is not valid, the latency will not be hi dden. To show the advantage and the overhead of the proposed technique, we perform extensive trace-driven simulations. The simulation results show tha t at least 62% of the latency for handling data damage can be hidden.