Ks. Mckinley et al., IMPROVING DATA LOCALITY WITH LOOP TRANSFORMATIONS, ACM transactions on programming languages and systems, 18(4), 1996, pp. 424-453
In the past decade, processor speed has become significantly faster th
an memory speed. Small, fast cache memories are designed to overcome t
his discrepancy, but they are only effective when programs exhibit dat
a locality. In this article, we present compiler optimizations to impr
ove data locality based on a simple yet accurate cost model. The model
computes both temporal and spatial reuse of cache lines to find desir
able loop organizations. The cost model drives the application of comp
ound transformations consisting of loop permutation, loop fusion, loop
distribution, and loop reversal. We demonstrate that these program tr
ansformations are useful for optimizing many programs. To validate our
optimization strategy, we implemented our algorithms and ran experime
nts on a large collection of scientific programs and kernels. Experime
nts illustrate that for kernels our model and algorithm can select and
achieve the best loop structure for a nest. For over 30 complete appl
ications, we executed the original and transformed versions and simula
ted cache hit rates. We collected statistics about the inherent charac
teristics of these programs and our ability to improve their data loca
lity. To our knowledge, these studies are the first of such breadth an
d depth. We found performance improvements were difficult to achieve b
ecause benchmark programs typically have high hit rates even for small
data caches; however, our optimizations significantly improved severa
l programs.