Commercial cache-coherent nonuniform memory access (ccNUMA) systems often r
equire extensive investments in hardware design and operating system suppor
t. A different approach to building these systems is to use Standard High V
olume (SHV) hardware and stock software components as building blocks and a
ssemble them with minimal investments in hardware and software. This design
approach trades the performance advantages of specialized hardware design
for simplicity and implementation speed, and relies on application-level tu
ning for scalability and performance. We present our experience with this a
pproach in this paper. We built a 16-way ccNUMA Intel system consisting of
four commodity four-processor Fujitsu((R)) Teamserver (TM) SMPs connected b
y a Synfinity (TM) cache-coherent switch. The system features a total of si
xteen 350-MHz lntel((R)) Xeon (TM) processors and 4 GB of physical memory,
and runs the standard commercial Microsoft Windows NT(R) operating system.
The system can be partitioned statically or dynamically, and uses an innova
tive, combined hardware/software approach to support application-level perf
ormance tuning. On the hardware side, a programmable performance-monitor ca
rd measures the frequency of remote-memory accesses, which constitute the p
redominant source of performance overhead. The monitor does not cause any p
erformance overhead and can be deployed in production mode, providing the p
ossibility for dynamic performance tuning if the application workload chang
es over time. On the software side, the Resource Set abstraction allows app
lication-level threads to improve performance and scalability by specifying
their execution and memory affinity across the ccNUMA system. Results from
a performance-evaluation study confirm the success of the combined hardwar
e/software approach for performance tuning in computation-intensive workloa
ds. The results also show that the poor local-memory bandwidth in commodity
Intel-based systems, rather than the latency of remote-memory access, is o
ften the main contributor to poor scalability and performance. The contribu
tions of this work can be summarized as follows:
The Resource Set abstraction allows control over resource allocation in a p
ortable manner across ccNUMA architectures; we describe how it was implemen
ted without modifying the operating system.
An innovative hardware design for a programmable performance-monitor card i
s designed specifically for a ccNUMA environment and allows dynamic, adapti
ve performance optimizations.
A performance study shows that performance and scalability are often limite
d by the local-memory bandwidth rather than by the effects of remote-memory
access in an Intel-based architecture.