In this paper, we study the implementation of dense linear algebra kernels,
such as matrix multiplication or linear system solvers, on heterogeneous n
etworks of workstations. The uniform block-cyclic data distribution scheme
commonly used for homogeneous collections of processors limits the performa
nce of these linear algebra kernels on heterogeneous grids to the speed of
the slowest processor. We present and study more sophisticated data allocat
ion strategies that balance the load on heterogeneous platforms with respec
t to the performance of the processors. When targeting unidimensional grids
, the load-balancing problem can be solved rather easily. When targeting tw
o-dimensional grids, which are the key to scalability and efficiency for nu
merical kernels, the problem turns out to be surprisingly difficult. We for
mally state the 2D load-balancing problem and prove its NP-completeness. Ne
xt, we introduce a data allocation heuristic, which turns out to be very sa
tisfactory: Its practical usefulness is demonstrated by MPI experiments con
ducted with a heterogeneous network of workstations.