Enhancing replica management services to cope with group failures

Citation
Pd. Ezhilchelvan et Sk. Shrivastava, Enhancing replica management services to cope with group failures, LECT N COMP, 1752, 2000, pp. 79-103
Citations number
22
Categorie Soggetti
Current Book Contents
ISSN journal
03029743
Volume
1752
Year of publication
2000
Pages
79 - 103
Database
ISI
SICI code
0302-9743(2000)1752:<79:ERMSTC>2.0.ZU;2-7
Abstract
In a distributed system, replication of components, such as objects, is a w ell known way of achieving availability. For increased availability, crashe d and disconnected components must be replaced by new components on availab le spare nodes. This replacement results in the membership of the replicate d group 'walking' over a number of machines during system operation. In thi s context, we address the problem of reconfiguring a group after the group as an entity has failed. Such a failure is termed a group failure which, fo r example, can be the crash of every component in the group or the group be ing partitioned into minority islands. The solution assumes crash-proof sto rage, and eventual recovery of crashed nodes and healing of partitions. It guarantees that (i) the number of groups reconfigured after a group failure is never more than one, and (ii) the reconfigured group contains a majorit y of the components which were members of the group just before the group f ailure occurred, so that the loss of state information due to a group failu re is minimal. Though the protocol is subject to blocking, it remains effic ient in terms of communication rounds and use of stable store, during both normal operations and reconfiguration after a group failure.