Compiler controlled prefetching for multiprocessors using low-overhead traps and prefetch engines

Citation
J. Skeppstedt et M. Dubois, Compiler controlled prefetching for multiprocessors using low-overhead traps and prefetch engines, J PAR DISTR, 60(5), 2000, pp. 585-615
Citations number
22
Categorie Soggetti
Computer Science & Engineering
Journal title
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
ISSN journal
07437315 → ACNP
Volume
60
Issue
5
Year of publication
2000
Pages
585 - 615
Database
ISI
SICI code
0743-7315(200005)60:5<585:CCPFMU>2.0.ZU;2-S
Abstract
In this paper we propose and evaluate a new data-prefetching technique for cache coherent multiprocessors. Prefetches are issued by a functional unit called a prefetch engine which is controlled by the compiler. We let second -level cache misses generate cache miss traps and start the prefetch engine in a trap handler. The trap handler is fast (40-50 cycles) and does not no rmally delay the program beyond the memory latency of the miss. Once starte d, the prefetch engine executes on its own and causes no instruction overhe ad. The only instruction overhead in our approach is when a trap handler co mpletes after data arrives. The advantages of this technique are (1) it exp loits static compiler analysis to determine what to prefetch, which is hard to do in hardware, (2) it uses prefetching with very little instruction ov erhead, which is a limitation for traditional software-controlled prefetchi ng, and (3) it is accurate in the sense that it generates very little usele ss traffic while maintaining a high prefetching coverage. We also study whe ther one could emulate the prefetch engine in software, which would not req uire any additional hardware beyond support for generating cache miss traps and ordinary prefetch instructions. In this paper we present the functiona lity of the prefetch engine and a compiler algorithm to control it. We eval uate our technique on six parallel scientific and engineering applications using an optimizing compiler with our algorithm and a simulated multiproces sor. We find that the prefetch engine removes up to 67% of the memory acces s stall time at an instruction overhead less than 0.42%. The emulated prefe tch engine removes in general less stall time at a higher instruction overh ead, (C) 2000 Academic Press.