ITA
ENG

INCORPORATION OF OPTIMAL TIMEOUTS INTO DISTRIBUTED REAL-TIME LOAD SHARING

Authors

HOU CJ SHIN KG

Citation

Cj. Hou et Kg. Shin, INCORPORATION OF OPTIMAL TIMEOUTS INTO DISTRIBUTED REAL-TIME LOAD SHARING, I.E.E.E. transactions on computers, 43(5), 1994, pp. 528-547

Citations number

Categorie Soggetti

Computer Sciences","Engineering, Eletrical & Electronic","Computer Science Hardware & Architecture

Journal title

I.E.E.E. transactions on computers → ACNP

ISSN journal

00189340

Volume

Issue

Year of publication

1994

Pages

528 - 547

Database

ISI

SICI code

0018-9340(1994)43:5<528:IOOTID>2.0.ZU;2-M

Abstract

This paper addresses the problem of designing and incorporating a time out mechanism into load sharing (LS) with state-region change broadcas ts in the presence of node failures in a distributed real-time system. Failure of a node is diagnosed by the other nodes through communicati on timeouts, and the timeout period used to diagnose whether a node is faulty or not usually depends on the dynamic changes in system load, the task attributes at the node, and the state the node was initially in. We formulate the problem of determining the ''best'' timeout perio d T(out)[i] for node i as a hypothesis testing problem, and maximize t he probability of detecting node failures subject to a pre-specified p robability of falsely diagnosing a healthy node as faulty. The paramet ers needed for the calculation of T(out)[i] are estimated on-line by n ode i using the Bayesian technique and are piggy-backed in its region- change broadcasts. The broadcast information is then used to determine T(out)[i]. If node n has not heard from node i for T(out)[i] since it s receipt of the latest broadcast from node i, it will consider node i failed, and will not consider any task transfer to node i until it re ceives a broadcast message from node i again. On the other hand, to fu rther reduce the probability of incorrect diagnosis, each node n also determines its own timeout period T(out)[i], and broadcasts its state not only at the time of state-region changes but also when it has rema ined within a broadcast interval throughout T(out)[n]. Our simulation results show that the LS algorithm which combines the on-line paramete r estimation, the timeout mechanism, and a few extra timely broadcasts can significantly reduce the probability of missing task deadlines, a s compared to the other algorithms either without any timeout mechanis m or with a fixed timeout period.