Cache coherent nonuniform memory access (CC-NUMA) multiprocessors provide a
scalable design for shared memory. But, they continue to suffer from large
remote memory access latencies due to comparatively slow memory technology
and large data transfer latencies in the interconnection network. In this
paper, we propose a novel hardware caching technique, called switch cache,
to improve the remote memory access performance of CC-NUMA multiprocessors.
The main idea is to implement small fast caches in crossbar switches of th
e interconnect medium to capture and store shared data as they flow from th
e memory module to the requesting processor. This stored data acts as a cac
he for subsequent requests, thus reducing the need for remote memory access
es tremendously. The implementation of a cache in a crossbar switch needs t
o be efficient and robust, yet flexible for changes in the caching protocol
. The design and implementation details of a CAche Embedded Switch ARchitec
ture, CAESAR, using wormhole routing with Virtual channels is presented. We
explore the design space of switch caches by modeling CAESAR in a detailed
execution driven simulator and analyze the performance benefits. Our resul
ts show that the CAESAR switch cache is capable of improving the performanc
e of CC-NUMA multiprocessors by up to 45 percent reduction in remote memory
accesses for some applications. By serving remote read requests at various
stages in the interconnect, we observe improvements in execution time as h
igh as 20 percent for these applications. We conclude that switch caches pr
ovide a cost-effective solution for designing high performance CC-NUMA mult
iprocessors.