Exploiting locality of reference is key to realizing high levels of perform
ance on modern processors. This paper describes a compiler algorithm for op
timizing cache locality in scientific codes on uniprocessor and multiproces
sor machines. A distinctive characteristic of our algorithm is that it cons
iders loop and data layout transformations in a unified framework. Our appr
oach is very effective at reducing cache misses and can optimize some nests
for which optimization techniques based on loop transformations alone are
not successful. An important special case is one in which data layouts of s
ome arrays are fixed and cannot be changed. We show how our algorithm can a
ccommodate this case and demonstrate how it can be used to optimize multipl
e loop nests. Experiments on several benchmarks show that the techniques pr
esented in this paper result in substantial improvement in cache performanc
e.