Execution replay and debugging of distributed multi-threaded parallel programs

Citation
Jc. De Kergommeaux et al., Execution replay and debugging of distributed multi-threaded parallel programs, COMPUT A IN, 19(6), 2000, pp. 511-526
Citations number
28
Categorie Soggetti
Computer Science & Engineering
Journal title
COMPUTERS AND ARTIFICIAL INTELLIGENCE
ISSN journal
02320274 → ACNP
Volume
19
Issue
6
Year of publication
2000
Pages
511 - 526
Database
ISI
SICI code
0232-0274(2000)19:6<511:ERADOD>2.0.ZU;2-9
Abstract
Clusters of shared-memory symmetric multiprocessors are increasingly used f or high performance computing. To exploit in a convenient way both the inne r parallelism of nodes and the parallelism between nodes, programming model s for communicating threads are being developed. However, most of these mod els result ill programs exhibiting non-deterministic behavior. This makes c yclic debugging of programs impossible, unless an efficient execution repla y system can be provided. This article describes such an execution replay s ystem for distributed thread programming combining synchronization primitiv es for threads sharing the same node, with communication primitives for thr eads of different nodes. The execution replay system combines the most effi cient trace size reduction technique for shared memory, based on the use of logical clocks, with a very efficient compression technique for trace data that originates from the test functions used in non-blocking communication s.