Clusters of symmetric multiprocessors (SMPs) are important platforms for hi
gh-performance computing. With the success of hardware cache-coherent distr
ibuted shared memory (DSM), a lot of effort has also been made to support t
he coherent shared-address-space programming model in software on clusters.
Much research has been done in fast communication on clusters and in proto
cols for supporting software shared memory across them. However, the perfor
mance of software virtual memory (SVM) is still far from that achieved on h
ardware DSM systems. The goal of this paper is to improve the performance o
f SVM on system area network clusters by considering communication and prot
ocol layer interactions. We first examine what are the important communicat
ion system bottlenecks that stand in the way of improving parallel performa
nce of SVM clusters; in particular, which parameters of the communication a
rchitecture are most important to improve further relative to processor spe
ed, which ones are already adequate on modern systems for most applications
, and how will this change with technology in the future. We find that the
most important communication subsystem cost to improve is the overhead of g
enerating and delivering interrupts for asynchronous protocol processing. T
hen we proceed to show, that by providing simple and general support for as
ynchronous message handling in a commodity network interface (NI) and by al
tering SVM protocols appropriately, protocol activity can be decoupled from
asynchronous message handling, and the need for interrupts or polling can
be eliminated. The NI mechanisms needed are generic, not SVM-dependent. We
prototype the mechanisms and such a synchronous home-based LRC protocol, ca
lled GeNIMA (GEneral-purpose Network Interface support for shared Memory Ab
stractions), on a cluster of SMPs with a programmable NI. We find that the
performance improvements are substantial, bringing performance on a small-s
cale SMP cluster much closer to that of hardware-coherent shared memory for
many applications, and we show the value of each of the mechanisms in diff
erent applications.