Finding errors in non-deterministic programs is complicated by the fac
t that a bug may be revealed only by a particular sequence of program
activities. An erroneous program may run correctly hundreds or thousan
ds of times, each time avoiding the failure-causing sequence. This pro
blem is exacerbated in distributed systems since race conditions on me
ssages may not be under the direct control of the programmer. We descr
ibe how message delivery ordering can be controlled during execution.
Our objective is to provide a practical yet powerful testing environme
nt for distributed systems, using re-execution. Previous work in this
area was limited to replaying deterministically the same execution rep
eatedly. We focus on re-executing the program, under a strictly differ
ent message ordering. In this way, latent bugs are more likely to reve
al themselves during testing. We show that messages are grouped into w
aves, such that any two messages from different waves must always be r
eceived in the same order. We provide an algorithm that produces a re-
execution that maximizes the number of reordered pairs of message deli
very events. We prove a tight lower bound of k - 1 reordered pairs of
messages where k is the number of messages in a wave. We also provide
an efficient on-line algorithm for detecting racing messages. Previous
methods for detecting race conditions were either off-line, or limite
d to detecting the races far a single process.