H. Higaki et T. Soneoka, GROUP-TO-GROUP COMMUNICATIONS FOR FAULT-TOLERANCE IN DISTRIBUTED SYSTEMS, IEICE transactions on information and systems, E76D(11), 1993, pp. 1348-1357
This paper proposes a group-to-group communications algorithm that can
extend the range of distributed systems where we can achieve active r
eplication fault-tolerance to partner model distributed systems, in wh
ich all processes communicate with each other on an equal footing. Act
ive replication approach, in which all replicated processes are active
, can achieve fault-tolerance with low overhead because checkpoint set
ting and rollback are not required for recovery from process failure.
This algorithm guarantees that each replicated process in a process gr
oup has the same execution history and that communications between pro
cess groups keeps consistency even in the presence of process failure
and message loss. The number of control messages that must be transmit
ted between processes for a communication between process groups is on
ly a linear order of the number of replicated processes in each proces
s group. Furthermore, this algorithm reduces the overhead for reconfig
uration of a process group by keeping process failure and recovery inf
ormation local to each process group.