ITA
ENG

OPTIMAL TRACING AND REPLAY FOR DEBUGGING MESSAGE-PASSING PARALLEL PROGRAMS

Authors

NETZER RHB MILLER BP

Citation

Rhb. Netzer et Bp. Miller, OPTIMAL TRACING AND REPLAY FOR DEBUGGING MESSAGE-PASSING PARALLEL PROGRAMS, Journal of supercomputing, 8(4), 1995, pp. 371-388

Citations number

Categorie Soggetti

Computer Sciences","Engineering, Eletrical & Electronic","Computer Science Hardware & Architecture","Computer Science Theory & Methods

Journal title

Journal of supercomputing → ACNP

ISSN journal

09208542

Volume

Issue

Year of publication

1995

Pages

371 - 388

Database

ISI

SICI code

0920-8542(1995)8:4<371:OTARFD>2.0.ZU;2-F

Abstract

A common debugging strategy involves reexecuting a program (on a given input) over and over, each time gaining more information about bugs. Such techniques can fail on message-passing parallel programs. Because of nondeterminacy, different runs on the given input may produce diff erent results. This nonrepeatability is a serious debugging problem, s ince an execution cannot always be reproduced to track down bugs. This paper presents a technique for tracing and replaying message-passing programs. By tracing the order in which messages are delivered, a reex ecution can be forced to deliver messages in their original order, rep roducing the original execution. To reduce the overhead of such a sche me, we show that the delivery order of only messages involved in races need be traced (and not every message). Our technique makes run-time decisions to detect and trace racing messages and is usually optimal i n the sense that the minimal number of racing messages is traced. Expe riments indicate that only 1% of the messages are often traced, gainin g a reduction of two orders of magnitude over traditional techniques t hat trace every message. These traces allow an execution to be reprodu ced any number of times for debugging. Our work is novel in that we ad aptively decide what to trace, and trace only those messages that intr oduce nondeterminacy. With our strategy, large reductions in trace siz e allow long-running programs to be replayed that were previously unma nageable. In addition, the reduced tracing requirements alleviate trac ing bottle-necks, allowing executions to be debugged with substantiall y lower execution time overhead.