Tc. Mowry et Ck. Luk, Understanding why correlation profiling improves the predictability of data cache misses in nonnumeric applications, IEEE COMPUT, 49(4), 2000, pp. 369-384
Latency-tolerance techniques offer the potential for bridging the ever-incr
easing speed gap between the memory subsystem and today's high-performance
processors. However, to fully exploit the benefit of these techniques, one
must be careful to apply them only to the dynamic references that are likel
y to suffer cache misses-otherwise the runtime overheads can potentially of
fset any gains. In this paper, we focus on isolating dynamic miss instances
in nonnumeric applications, which is a difficult but important problem. Al
though compilers cannot statically analyze data locality in nonnumeric appl
ications, one viable approach is to use profiling information to measure th
e actual miss behavior. Unfortunately, the state-of-the-art in cache miss p
rofiling (which we call summary profiling) is inadequate for references wit
h intermediate miss ratios-it either misses opportunities to hide latency,
or else inserts overhead that is unnecessary. To overcome this problem, we
propose and evaluate a new profiling technique that helps predict which dyn
amic instances of a static memory reference will hit or miss in the cache:
correlation profiling. Our experimental results demonstrate that roughly ha
lf of the 21 nonnumeric applications we study can potentially enjoy signifi
cant reductions in memory stall time by exploiting at least one of the thre
e forms of correlation profiling we consider: control-flow correlation, sel
f correlation, and global correlation. In addition, our detailed case studi
es illustrate that self correlation succeeds because a given reference's ca
che outcomes often contain repeated patterns and control-flow correlation s
ucceeds because cache outcomes are often call-chain dependent. Finally, we
suggest a number of ways to exploit correlation profiling in practice and d
emonstrate that software prefetching can achieve better performance on a mo
dern superscalar processor when directed by correlation profiling rather th
an summary profiling information.