The dominant architecture for the next generation of shared-memory mul
tiprocessors is CC-NUMA (cache-coherent nonuniform memory architecture
). These machines are attractive as compute servers because they provi
de transparent access to local and remote memory. However the access l
atency to remote memory is 3 to 5 times the latency to local memory. C
C-NOW machines provide the benefits of cache coherence to networks of
workstations, at the cost of even higher remote access latency. Given
the large remote access latencies of these architectures data locality
is potentially the most important performance issue. Using realistic
workloads, we study the performance improvements provided by OS suppor
ted dynamic page migration and replication. Analyzing our kernel-based
implementation, we provide a detailed breakdown of the costs. We show
that sampling of cache misses can be used to reduce cost without comp
romising performance, and that TLB misses may not be a consistent appr
oximation for cache misses. Finally, our experiments show that dynamic
page migration and replication can substantially increase application
performance, as much as 30%, and reduce contention for resources in t
he NUMA memory system.