J. Skeppstedt et M. Dubois, Compiler controlled prefetching for multiprocessors using low-overhead traps and prefetch engines, J PAR DISTR, 60(5), 2000, pp. 585-615
In this paper we propose and evaluate a new data-prefetching technique for
cache coherent multiprocessors. Prefetches are issued by a functional unit
called a prefetch engine which is controlled by the compiler. We let second
-level cache misses generate cache miss traps and start the prefetch engine
in a trap handler. The trap handler is fast (40-50 cycles) and does not no
rmally delay the program beyond the memory latency of the miss. Once starte
d, the prefetch engine executes on its own and causes no instruction overhe
ad. The only instruction overhead in our approach is when a trap handler co
mpletes after data arrives. The advantages of this technique are (1) it exp
loits static compiler analysis to determine what to prefetch, which is hard
to do in hardware, (2) it uses prefetching with very little instruction ov
erhead, which is a limitation for traditional software-controlled prefetchi
ng, and (3) it is accurate in the sense that it generates very little usele
ss traffic while maintaining a high prefetching coverage. We also study whe
ther one could emulate the prefetch engine in software, which would not req
uire any additional hardware beyond support for generating cache miss traps
and ordinary prefetch instructions. In this paper we present the functiona
lity of the prefetch engine and a compiler algorithm to control it. We eval
uate our technique on six parallel scientific and engineering applications
using an optimizing compiler with our algorithm and a simulated multiproces
sor. We find that the prefetch engine removes up to 67% of the memory acces
s stall time at an instruction overhead less than 0.42%. The emulated prefe
tch engine removes in general less stall time at a higher instruction overh
ead, (C) 2000 Academic Press.