The RISC BLAS: A blocked implementation of Level 3 BLAS for RISC processors

Citation
Mj. Dayde et Is. Duff, The RISC BLAS: A blocked implementation of Level 3 BLAS for RISC processors, ACM T MATH, 25(3), 1999, pp. 316-340
Citations number
27
Categorie Soggetti
Computer Science & Engineering
Journal title
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
ISSN journal
00983500 → ACNP
Volume
25
Issue
3
Year of publication
1999
Pages
316 - 340
Database
ISI
SICI code
0098-3500(199909)25:3<316:TRBABI>2.0.ZU;2-I
Abstract
We describe a version of the Level 3 BLAS which is designed to be efficient on RISC processors. This is an extension of previous studies by the author s and colleagues on a similar approach for efficient serial and parallel im plementations on virtual-memory and shared-memory multiprocessors. All our codes are written in Fortran and use loop-unrolling, blocking, and copying to improve the performance. A blocking technique is used to express the BLA S in terms of operations involving triangular blocks and calls to the matri x-matrix multiplication kernel (GEMM). No manufacturer-supplied or assemble r code is used. This blocked implementation uses the same blocking ideas as in our implementation for vector machines except that the ordering of loop s is designed for efficient reuse of data held in cache and not necessarily for parallelization. All the codes are specifically tuned for RISC process ors. The software also includes a tuned version of GEMM. A parameter which controls the blocking allows efficient exploitation of the memory hierarchy on the various target computers. We present results on a range of RISC-bas ed workstations and multiprocessors: GRAY T3D, DEC 8400 5/300, HP 715/64, I BM SP2, MEIKO CS2-HA, SGI Power Challenge 10000, and SUN UItraSPARC-1 model 140. This implementation of the Level 3 BLAS is available on anonymous FTP , and me welcome input from users to improve and extend our BLAS implementa tion.