Rhb. Netzer et Bp. Miller, OPTIMAL TRACING AND REPLAY FOR DEBUGGING MESSAGE-PASSING PARALLEL PROGRAMS, Journal of supercomputing, 8(4), 1995, pp. 371-388
A common debugging strategy involves reexecuting a program (on a given
input) over and over, each time gaining more information about bugs.
Such techniques can fail on message-passing parallel programs. Because
of nondeterminacy, different runs on the given input may produce diff
erent results. This nonrepeatability is a serious debugging problem, s
ince an execution cannot always be reproduced to track down bugs. This
paper presents a technique for tracing and replaying message-passing
programs. By tracing the order in which messages are delivered, a reex
ecution can be forced to deliver messages in their original order, rep
roducing the original execution. To reduce the overhead of such a sche
me, we show that the delivery order of only messages involved in races
need be traced (and not every message). Our technique makes run-time
decisions to detect and trace racing messages and is usually optimal i
n the sense that the minimal number of racing messages is traced. Expe
riments indicate that only 1% of the messages are often traced, gainin
g a reduction of two orders of magnitude over traditional techniques t
hat trace every message. These traces allow an execution to be reprodu
ced any number of times for debugging. Our work is novel in that we ad
aptively decide what to trace, and trace only those messages that intr
oduce nondeterminacy. With our strategy, large reductions in trace siz
e allow long-running programs to be replayed that were previously unma
nageable. In addition, the reduced tracing requirements alleviate trac
ing bottle-necks, allowing executions to be debugged with substantiall
y lower execution time overhead.