Experience with building a commodity Intel-based ccNUMA system

Citation
Bc. Brock et al., Experience with building a commodity Intel-based ccNUMA system, IBM J RES, 45(2), 2001, pp. 207-227
Citations number
28
Categorie Soggetti
Multidisciplinary,"Computer Science & Engineering
Journal title
IBM JOURNAL OF RESEARCH AND DEVELOPMENT
ISSN journal
00188646 → ACNP
Volume
45
Issue
2
Year of publication
2001
Pages
207 - 227
Database
ISI
SICI code
0018-8646(200103)45:2<207:EWBACI>2.0.ZU;2-T
Abstract
Commercial cache-coherent nonuniform memory access (ccNUMA) systems often r equire extensive investments in hardware design and operating system suppor t. A different approach to building these systems is to use Standard High V olume (SHV) hardware and stock software components as building blocks and a ssemble them with minimal investments in hardware and software. This design approach trades the performance advantages of specialized hardware design for simplicity and implementation speed, and relies on application-level tu ning for scalability and performance. We present our experience with this a pproach in this paper. We built a 16-way ccNUMA Intel system consisting of four commodity four-processor Fujitsu((R)) Teamserver (TM) SMPs connected b y a Synfinity (TM) cache-coherent switch. The system features a total of si xteen 350-MHz lntel((R)) Xeon (TM) processors and 4 GB of physical memory, and runs the standard commercial Microsoft Windows NT(R) operating system. The system can be partitioned statically or dynamically, and uses an innova tive, combined hardware/software approach to support application-level perf ormance tuning. On the hardware side, a programmable performance-monitor ca rd measures the frequency of remote-memory accesses, which constitute the p redominant source of performance overhead. The monitor does not cause any p erformance overhead and can be deployed in production mode, providing the p ossibility for dynamic performance tuning if the application workload chang es over time. On the software side, the Resource Set abstraction allows app lication-level threads to improve performance and scalability by specifying their execution and memory affinity across the ccNUMA system. Results from a performance-evaluation study confirm the success of the combined hardwar e/software approach for performance tuning in computation-intensive workloa ds. The results also show that the poor local-memory bandwidth in commodity Intel-based systems, rather than the latency of remote-memory access, is o ften the main contributor to poor scalability and performance. The contribu tions of this work can be summarized as follows: The Resource Set abstraction allows control over resource allocation in a p ortable manner across ccNUMA architectures; we describe how it was implemen ted without modifying the operating system. An innovative hardware design for a programmable performance-monitor card i s designed specifically for a ccNUMA environment and allows dynamic, adapti ve performance optimizations. A performance study shows that performance and scalability are often limite d by the local-memory bandwidth rather than by the effects of remote-memory access in an Intel-based architecture.