We describe a version of the Level 3 BLAS which is designed to be efficient
on RISC processors. This is an extension of previous studies by the author
s and colleagues on a similar approach for efficient serial and parallel im
plementations on virtual-memory and shared-memory multiprocessors. All our
codes are written in Fortran and use loop-unrolling, blocking, and copying
to improve the performance. A blocking technique is used to express the BLA
S in terms of operations involving triangular blocks and calls to the matri
x-matrix multiplication kernel (GEMM). No manufacturer-supplied or assemble
r code is used. This blocked implementation uses the same blocking ideas as
in our implementation for vector machines except that the ordering of loop
s is designed for efficient reuse of data held in cache and not necessarily
for parallelization. All the codes are specifically tuned for RISC process
ors. The software also includes a tuned version of GEMM. A parameter which
controls the blocking allows efficient exploitation of the memory hierarchy
on the various target computers. We present results on a range of RISC-bas
ed workstations and multiprocessors: GRAY T3D, DEC 8400 5/300, HP 715/64, I
BM SP2, MEIKO CS2-HA, SGI Power Challenge 10000, and SUN UItraSPARC-1 model
140. This implementation of the Level 3 BLAS is available on anonymous FTP
, and me welcome input from users to improve and extend our BLAS implementa
tion.