AUTO-BLOCKING MATRIX-MULTIPLICATION OR TRACKING BLAS3 PERFORMANCE FROM SOURCE CODE

Authors
Citation
Jd. Frens et Ds. Wise, AUTO-BLOCKING MATRIX-MULTIPLICATION OR TRACKING BLAS3 PERFORMANCE FROM SOURCE CODE, ACM SIGPLAN NOTICES, 32(7), 1997, pp. 206-216
Citations number
24
Categorie Soggetti
Computer Sciences","Computer Science Software Graphycs Programming
Journal title
Volume
32
Issue
7
Year of publication
1997
Pages
206 - 216
Database
ISI
SICI code
Abstract
An elementary, machine-independent, recursive algorithm for matrix mul tiplication C+=AB provides implicit blocking at every level of the me mory hierarchy and tests out faster than classically optimal code, tra cking hand-coded BLAS3 routines. Proof of concept is demonstrated by r acing the in-place algorithm against manufacturer's hand-tuned BLAS3 r outines; it can win. The recursive code bifurcates naturally at the to p level into independent block-oriented processes, that each writes to a disjoint and contigous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all leve ls of the memory hierarchy, independently of the sizes of caches or pa ges and without ad hoc programming. It also exposed a weakness in SGI' s C compilers that merrily unroll loops for the superscalar R8000 proc essor, but do not analogously unfold the base cases of the most elemen tary recursions. Such deficiencies might deter future programmers from using this rich class of recursive algorithms.