S. Toledo, IMPROVING THE MEMORY-SYSTEM PERFORMANCE OF SPARSE-MATRIX VECTOR MULTIPLICATION, IBM journal of research and development, 41(6), 1997, pp. 711-725
Sparse-matrix vector multiplication is an important kernel that often
runs inefficiently on superscalar RISC processors. This paper describe
s techniques that increase instruction-level parallelism and improve p
erformance. The techniques include reordering to reduce cache misses (
originally due to Das et al.), blocking to reduce load instructions, a
nd prefetching to prevent multiple load-store units from starring simu
ltaneously. The techniques improve performance from about 40 MFLOPS (o
n a well-ordered matrix) to more than 100 MFLOPS on a 266-MFLOPS machi
ne. The techniques are applicable to other superscalar RISC processors
as well, and have improved performance on a Sun UltraSPARC(TM) I work
station, for example.