In this paper we present an efficient dense matrix multiplication algo
rithm for distributed memory computers with a hypercube topology. The
proposed algorithm performs better than all previously proposed algori
thms for a wide range of matrix sizes and number of processors, especi
ally for large matrices. We analyze the performance of the algorithms
for two types of hypercube architectures, one in which each node can u
se (to send and receive) at most one communication link at a time and
the other in which each node can use all communication links simultane
ously.