In scalable parallel machines, processors can make local memory access
es much faster than they can make remote memory accesses. Additionally
, when a number of remote accesses must be made, it is usually more ef
ficient to use block transfers of data rather than to use many small m
essages. To run well on such machines, software must exploit these fea
tures. We believe it is too onerous for a programmer to do this by han
d, so we have been exploring the use of restructuring compiler technol
ogy for this purpose. In this article, we start with a language like H
PF-Fortran with user-specified data distribution and develop a systema
tic loop transformation strategy called access normalization that rest
ructures loop nests to exploit locality and block transfers. We demons
trate the power of our techniques using routines from the BLAS (Basic
Linear Algebra Subprograms) library. An important feature of our appro
ach is that we model loop transformations using invertible matrices an
d integer lattice theory.