A three-dimensional (3D) matrix multiplication algorithm for massively
parallel processing systems is presented. The P processors are config
ured as a ''virtual'' processing cube with dimensions p(1), p(2), and
p(3) proportional to the matrices' dimensions-M, N,and K. Each process
or performs a single local matrix multiplication of size M/p(1) x N/p(
2) x K/p(3). Before the local computation can be carried out, each sub
cube must receive a single submatrix of A and B, After the single matr
ix multiplication has completed, K/p(3), submatrices of this product m
ust be sent to their respective destination processors and then summed
together with the resulting matrix C. The 3D parallel matrix multipli
cation approach has a factor of P-1/6 less communication than the 2D p
arallel algorithms. This algorithm has been implemented on IBM POWERpa
rallel(TM) SP2(TM) systems (up to 216 nodes) and has yielded close to
the peak performance of the machine. The algorithm has been combined w
ith Winograd's variant of Strassen's algorithm to achieve performance
which exceeds the theoretical peak of the system. (We assume the MFLOP
S rate of matrix multiplication to be 2MNK.)