Emmerald: a fast matrix-matrix multiply using Intel's SSE instructions

Citation
D. Aberdeen et J. Baxter, Emmerald: a fast matrix-matrix multiply using Intel's SSE instructions, CONCURR COM, 13(2), 2001, pp. 103-119
Citations number
15
Categorie Soggetti
Computer Science & Engineering
Journal title
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE
ISSN journal
15320626 → ACNP
Volume
13
Issue
2
Year of publication
2001
Pages
103 - 119
Database
ISI
SICI code
1532-0626(200102)13:2<103:EAFMMU>2.0.ZU;2-L
Abstract
Generalized matrix-matrix multiplication forms the kernel of many mathemati cal algorithms, hence a faster matrix-matrix multiply immediately benefits these algorithms, In this paper we implement efficient matrix multiplicatio n for large matrices using the Intel Pentium single instruction multiple da ta (SIMD) floating point architecture. The main difficulty with the Pentium and other commodity processors is the need to efficiently utilize the cach e hierarchy, particularly given the growing gap between main-memory and CPU clock speeds. We give a detailed description of the register allocation, L evel 1 and Level 2 cache blocking strategies that yield the best performanc e for the Pentium III family. Our results demonstrate an average performanc e of 2.09 times faster than the leading public domain matrix-matrix multipl y routines and comparable performance with Intel's SIMD small matrix-matrix multiply routines. Copyright (C) 2001 John Wiley & Sons, Ltd.