OPTIMAL TRACING AND REPLAY FOR DEBUGGING MESSAGE-PASSING PARALLEL PROGRAMS

Citation
Rhb. Netzer et Bp. Miller, OPTIMAL TRACING AND REPLAY FOR DEBUGGING MESSAGE-PASSING PARALLEL PROGRAMS, Journal of supercomputing, 8(4), 1995, pp. 371-388
Citations number
12
Categorie Soggetti
Computer Sciences","Engineering, Eletrical & Electronic","Computer Science Hardware & Architecture","Computer Science Theory & Methods
Journal title
ISSN journal
09208542
Volume
8
Issue
4
Year of publication
1995
Pages
371 - 388
Database
ISI
SICI code
0920-8542(1995)8:4<371:OTARFD>2.0.ZU;2-F
Abstract
A common debugging strategy involves reexecuting a program (on a given input) over and over, each time gaining more information about bugs. Such techniques can fail on message-passing parallel programs. Because of nondeterminacy, different runs on the given input may produce diff erent results. This nonrepeatability is a serious debugging problem, s ince an execution cannot always be reproduced to track down bugs. This paper presents a technique for tracing and replaying message-passing programs. By tracing the order in which messages are delivered, a reex ecution can be forced to deliver messages in their original order, rep roducing the original execution. To reduce the overhead of such a sche me, we show that the delivery order of only messages involved in races need be traced (and not every message). Our technique makes run-time decisions to detect and trace racing messages and is usually optimal i n the sense that the minimal number of racing messages is traced. Expe riments indicate that only 1% of the messages are often traced, gainin g a reduction of two orders of magnitude over traditional techniques t hat trace every message. These traces allow an execution to be reprodu ced any number of times for debugging. Our work is novel in that we ad aptively decide what to trace, and trace only those messages that intr oduce nondeterminacy. With our strategy, large reductions in trace siz e allow long-running programs to be replayed that were previously unma nageable. In addition, the reduced tracing requirements alleviate trac ing bottle-necks, allowing executions to be debugged with substantiall y lower execution time overhead.