Generalized matrix-matrix multiplication forms the kernel of many mathemati
cal algorithms, hence a faster matrix-matrix multiply immediately benefits
these algorithms, In this paper we implement efficient matrix multiplicatio
n for large matrices using the Intel Pentium single instruction multiple da
ta (SIMD) floating point architecture. The main difficulty with the Pentium
and other commodity processors is the need to efficiently utilize the cach
e hierarchy, particularly given the growing gap between main-memory and CPU
clock speeds. We give a detailed description of the register allocation, L
evel 1 and Level 2 cache blocking strategies that yield the best performanc
e for the Pentium III family. Our results demonstrate an average performanc
e of 2.09 times faster than the leading public domain matrix-matrix multipl
y routines and comparable performance with Intel's SIMD small matrix-matrix
multiply routines. Copyright (C) 2001 John Wiley & Sons, Ltd.