Mj. Dayde et al., A PARALLEL BLOCK IMPLEMENTATION OF LEVEL-3 BLAS FOR MIMD VECTOR PROCESSORS, ACM transactions on mathematical software, 20(2), 1994, pp. 178-193
We describe an implementation of Level-3 BLAS (Basic Linear Algebra Su
bprograms) based on the use of the matrix-matrix multiplication kernel
(GEMM). Blocking techniques are used to express the BLAS in terms of
operations involving triangular blocks and calls to GEMM. A principal
advantage of this approach is that most manufacturers provide at least
an efficient serial version of GEMM so that our implementation can ca
pture a significant percentage of the computer performance. A paramete
r which controls the blocking allows an efficient exploitation of the
memory hierarchy of the various target computers. Furthermore, this bl
ocked version of Level-3 BLAS is naturally parallel. We present result
s on the ALLIANT FX/80, the CONVEX C220, the CRAY-2, and the IBM 3090/
VF. For GEMM, we always use the manufacturer-supplied versions. For th
e operations dealing with triangular blocks, we use assembler or tuned
Fortran (using loop-unrolling) codes, depending on the efficiency of
the available libraries.