ITA
ENG

Efficient recovery from communication errors in distributed shared memory systems

Authors

Lin, JW Kuo, SY

Citation

Jw. Lin et Sy. Kuo, Efficient recovery from communication errors in distributed shared memory systems, IEICE T INF, E81D(11), 1998, pp. 1213-1223

Citations number

Categorie Soggetti

Information Tecnology & Communication Systems

Journal title

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS

ISSN journal

09168532 → ACNP

Volume

E81D

Issue

Year of publication

1998

Pages

1213 - 1223

Database

ISI

SICI code

0916-8532(199811)E81D:11<1213:ERFCEI>2.0.ZU;2-O

Abstract

This paper investigates the problem of communication errors in distributed shared memory (DSM) systems. Communication errors can introduce two critica l problems: damage and loss. The damage problem makes the transmitted data destroyed and then produces incorrect computational results. The loss probl em causes the transmitted data lost during transmission and then not receiv ed. However, the loss problem can be easily resolved using acknowledgement. Therefore, we focus on how to efficiently handle the damage problem. In DS M systems, the size of data transferred between nodes is larger than the si ze actually shared between nodes. That is, when a processing node receives data, not all the data items in this received data will be used. Based on t his property, we present a new technique to resolve the data damage problem in DSM systems. This technique allows a processing node to continue comput ation without being blocked to wail for the correct data when it receives d amaged data. Therefore, the latency for handling the data damage can be hid den. However, there is an optimistic assumption made in the proposed techni que. If this optimistic assumption is not valid, the latency will not be hi dden. To show the advantage and the overhead of the proposed technique, we perform extensive trace-driven simulations. The simulation results show tha t at least 62% of the latency for handling data damage can be hidden.