M. Kandemir et al., Compiler algorithms for optimizing locality and parallelism on shared and distributed-memory machines, J PAR DISTR, 60(8), 2000, pp. 924-965
Distributed-memory message-passing machines deliver scalable performance bu
t are difficult to program. Shared-memory machines, on the other hand, are
easier to program but obtaining scalable performance with. large number of
processors is difficult. Recently, scalable machines based on logically sha
red physically distributed memory have been designed and implemented. While
some of the performance issues like parallelism and locality are common to
different parallel architectures, issues such as data distribution are uni
que to specific architectures. One of the most important challenges compile
r writers face is the design of compilation techniques that can work well o
n a variety of architectures. In this paper, we propose an algorithm that c
an be employed by optimizing compilers for different types of parallel arch
itectures. Our optimization algorithm does the following: (1) transforms lo
op nests such that, where possible, the iterations of the outer-most loops
can be run in parallel across processors; (2) optimizes memory locality by
carefully distributing each array across processors; (3) optimizes interpro
cessor communication using message vectorization whenever possible; and (4)
optimizes cache locality by assigning appropriate storage layout for each
array and by transforming the iteration space. Depending on the machine arc
hitecture, some or all of these steps can he applied in a unified framework
. We report empirical results on an SGI Origin 2000 distributed-shared-memo
ry multiprocessor and an IBM SP-2 distributed-memory message-passing machin
e to validate the effectiveness of our approach. (C) 2000 Academic Press.