Providing adequate data bandwidth is extremely important for a future wide-
issue processor to achieve its full performance potential. Adding a large n
umber of ports to a data cache, however, becomes increasingly inefficient a
nd can 'add to the hardware complexity significantly. This paper takes an a
lternative or complementary approach for providing more data bandwidth, cal
led data decoupling. This paper especially studies an interesting, yet less
explored, behavior of memory access instructions, called access region loc
ality, which is concerned with each static memory instruction and its range
of access locations at runtime. Our experimental study using a set of SPEC
95 benchmark programs shows that most memory access instructions reference
a single region at runtime. Also shown is that it is possible to accurately
predict the access region of a memory instruction at runtime by scrutinizi
ng the addressing mode of the instruction and the past access history of it
. We describe and evaluate a wide-issue superscalar processor with two dist
inct sets of memory pipelines and caches, driven by the access region predi
ctor. Experimental results indicate that the proposed mechanism is very eff
ective in providing high memory bandwidth to the processor, resulting in co
mparable or better performance than a conventional memory design with a hea
vily multiported data cache that can lead to much higher hardware complexit
y.