A. Rogers et K. Pingali, COMPILING FOR DISTRIBUTED-MEMORY ARCHITECTURES, IEEE transactions on parallel and distributed systems, 5(3), 1994, pp. 281-298
Citations number
35
Categorie Soggetti
System Science","Engineering, Eletrical & Electronic","Computer Science Theory & Methods
Parallel computers provide a large degree of computational power for p
rogrammers who are willing and able to harness it. The introduction of
high-level languages and good compilers made possible the wide use of
sequential machines, but the lack of such tools for parallel machines
hinders their widespread acceptance and use. Programmers must address
issues such as process decomposition, synchronization, and load balan
cing. This is a severe burden and opens the door to time-dependent bug
s, such as race conditions between reads and writes, which are extreme
ly difficult to detect. We have developed a parallelizing compiler tha
t, given a sequential program and a memory layout of its data, perform
s process decomposition while balancing parallelism against locality o
f reference. A process decomposition is obtained by specializing the p
rogram for each processor to the data that resides on that processor.
If this analysis fails, the compiler falls back to a simple but ineffi
cient scheme called run-time resolution. Each process's role in the co
mputation is determined by examining the data required for execution a
t run-time. Thus, our approach to process decomposition is data-driven
rather than program-driven. We discuss several message optimizations
that address the issues of overhead and synchronization in message tra
nsmission. Accumulation reorganizes the computation of a commutative a
nd associative operator to reduce message traffic. Pipelining sends a
value as close to its computation as possible to increase parallelism.
Vectorization of messages combines messages with the same source and
the same destination to reduce overhead. Our results from experiments
in parallelizing SIMPLE, a large hydrodynamics benchmark, for the Inte
l iPSC/2, show a speedup within 60% to 70% of handwritten code.