ITA
ENG

Architectural and compiler support for effective instruction prefetching: A cooperative approach

Authors

Luk, CK Mowry, TC

Citation

Ck. Luk et Tc. Mowry, Architectural and compiler support for effective instruction prefetching: A cooperative approach, ACM T COMP, 19(1), 2001, pp. 71-109

Citations number

Categorie Soggetti

Computer Science & Engineering

Journal title

ACM TRANSACTIONS ON COMPUTER SYSTEMS

ISSN journal

07342071 → ACNP

Volume

Issue

Year of publication

2001

Pages

71 - 109

Database

ISI

SICI code

0734-2071(200102)19:1<71:AACSFE>2.0.ZU;2-Z

Abstract

Instruction cache miss latency is becoming an increasingly important perfor mance bottleneck, especially for commercial applications. Although instruct ion prefetching is an attractive technique for tolerating this latency, we find that existing prefetching schemes are insufficient for modern supersca lar processors, since they fail to issue prefetches early enough (particula rly for nonsequential accesses). To overcome these limitations, we propose a new instruction prefetching technique whereby the hardware and software c ooperate to hide the latency as follows. The hardware performs aggressive s equential prefetching combined with a novel prefetch filtering mechanism to allow it to get far ahead without polluting the cache. To hide the latency of nonsequential accesses, we propose and implement a novel compiler algor ithm which automatically inserts instruction-prefetch instructions into the executable to prefetch the targets of control transfers far enough in adva nce. Our experimental results demonstrate that this new approach hides 50% or more of the latency remaining with the best previous techniques, while a t the same time reduces the number of useless prefetches by a factor of six . We find that both the prefetch filtering and compiler-inserted prefetchin g components of our design are essential and complementary, and that the co mpiler can limit the code expansion to only 9% on average. In addition, we show that the performance of our technique can be further increased by usin g profiling information to help reduce cache conflicts and unnecessary pref etches. From an architectural perspective, these performance advantages are sustained over a range of common miss latencies and bandwidth. Finally, ou r technique is cost effective as well, since it delivers performance compar able to (or even better than) that of larger caches, but requires a much sm aller hardware budget.