The block-cyclic data distribution is commonly used to organize array eleme
nts over the processors of a coarse-grained distributed memory parallel com
puter. In many scientific applications, the data layout must be reorganized
at run-time in order to enhance locality and reduce remote memory access o
verheads. In this paper we present a general framework for developing array
redistribution algorithms. Using this framework, we have developed efficie
nt algorithms that redistribute an array from one block-cyclic layout to an
other.
Block-cyclic redistribution consists of index set computation, wherein the
destination locations for individual data blocks are calculated, and data c
ommunication, wherein these blocks are exchanged between processors. The fr
amework treats both these operations in a uniform and integrated way. We ha
ve developed efficient and distributed algorithms for index set computation
that do not require any interprocessor communication. To perform data comm
unication in a conflict-free manner, we have developed direct, indirect, an
d hybrid algorithms. In the direct algorithm, a data block is transferred d
irectly to its destination processor. In an indirect algorithm, data blocks
are moved from source to destination processors through intermediate relay
processors. The hybrid algorithm is a combination of the direct and indire
ct algorithms.
Our framework is based on a generalized circulant matrix formalism of the r
edistribution problem and a general purpose distributed memory model of the
parallel machine. Our algorithms sustain excellent performance over a wide
range of problem and machine parameters. We have implemented our algorithm
s using MPI, to allow for easy portability across different HPC platforms.
Experimental results on the IBM SP-2 and the Gray T3D show superior perform
ance over previous approaches. When the block size of the cyclic data layou
t changes by a factor of K, the redistribution can be performed in O(log K)
communication steps. This is true even when K is a prime number. In contra
st, previous approaches take O(K) communication steps for redistribution.
Our framework can be used for developing scalable redistribution libraries,
for efficiently implementing parallelizing compiler directives, and for de
veloping parallel algorithms for various applications. Redistribution algor
ithms are especially useful in signal processing applications, where the da
ta access patterns change significantly between computational phases. They
are also necessary in linear algebra programs, to perform matrix transpose
operations.