Parallel workstations, each comprising tens of processors based on shared m
emory, promise cost-effective scalable multiprocessing. This article explor
es the coupling of such small- to medium-scale shared-memory multiprocessor
s through software over a local area network to synthesize larger shared-me
mory systems. We call these systems Distributed Shared-memory MultiProcesso
rs (DSMPs). This article introduces the design of a shared-memory system th
at uses multiple granularities of sharing, called MGS, and presents a proto
type implementation of MGS on the MIT Alewife multiprocessor. Multigrain sh
ared memory enables the collaboration of hardware and software shared memor
y, thus synthesizing a single transparent shared-memory address space acros
s a cluster of multiprocessors. The system leverages the efficient support
for fine-grain cache-line sharing within multiprocessor nodes as often as p
ossible, and resorts to coarse-grain page-level sharing across nodes only w
hen absolutely necessary. Using our prototype implementation of MGS, an in-
depth study of several shared-memory applications is conducted to understan
d the behavior of DSMPs. Our study is the first to comprehensively explore
the DSMP design space, and to compare the performance of DSMPs against all-
software and all-hardware DSMs on a single experimental platform. Keeping t
he total number of processors fixed, we show that applications execute up t
o 85% faster on a DSMP as compared to an all-software DSM. We also show tha
t all-hardware DSMs hold a significant performance advantage over DSMPs on
challenging applications, between 159% and 1014%. However, program transfor
mations to improve data locality for these applications allow DSMPs to almo
st match the performance of an all-hardware multiprocessor of the same size
.