Clusters of shared-memory symmetric multiprocessors are increasingly used f
or high performance computing. To exploit in a convenient way both the inne
r parallelism of nodes and the parallelism between nodes, programming model
s for communicating threads are being developed. However, most of these mod
els result ill programs exhibiting non-deterministic behavior. This makes c
yclic debugging of programs impossible, unless an efficient execution repla
y system can be provided. This article describes such an execution replay s
ystem for distributed thread programming combining synchronization primitiv
es for threads sharing the same node, with communication primitives for thr
eads of different nodes. The execution replay system combines the most effi
cient trace size reduction technique for shared memory, based on the use of
logical clocks, with a very efficient compression technique for trace data
that originates from the test functions used in non-blocking communication
s.