ITA
ENG

The RISC BLAS: A blocked implementation of Level 3 BLAS for RISC processors

Authors

Dayde, MJ Duff, IS

Citation

Mj. Dayde et Is. Duff, The RISC BLAS: A blocked implementation of Level 3 BLAS for RISC processors, ACM T MATH, 25(3), 1999, pp. 316-340

Citations number

Categorie Soggetti

Computer Science & Engineering

Journal title

ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE

ISSN journal

00983500 → ACNP

Volume

Issue

Year of publication

1999

Pages

316 - 340

Database

ISI

SICI code

0098-3500(199909)25:3<316:TRBABI>2.0.ZU;2-I

Abstract

We describe a version of the Level 3 BLAS which is designed to be efficient on RISC processors. This is an extension of previous studies by the author s and colleagues on a similar approach for efficient serial and parallel im plementations on virtual-memory and shared-memory multiprocessors. All our codes are written in Fortran and use loop-unrolling, blocking, and copying to improve the performance. A blocking technique is used to express the BLA S in terms of operations involving triangular blocks and calls to the matri x-matrix multiplication kernel (GEMM). No manufacturer-supplied or assemble r code is used. This blocked implementation uses the same blocking ideas as in our implementation for vector machines except that the ordering of loop s is designed for efficient reuse of data held in cache and not necessarily for parallelization. All the codes are specifically tuned for RISC process ors. The software also includes a tuned version of GEMM. A parameter which controls the blocking allows efficient exploitation of the memory hierarchy on the various target computers. We present results on a range of RISC-bas ed workstations and multiprocessors: GRAY T3D, DEC 8400 5/300, HP 715/64, I BM SP2, MEIKO CS2-HA, SGI Power Challenge 10000, and SUN UItraSPARC-1 model 140. This implementation of the Level 3 BLAS is available on anonymous FTP , and me welcome input from users to improve and extend our BLAS implementa tion.