The emergence of low latency, high throughput routers means that netwo
rk locality issues no longer dominate the performance of parallel algo
rithms. One of the key performance issues is now the even distribution
of work across the machine, as the problem size and number of process
ors increase. This paper describes the implementation of a highly scal
able shared queue, supporting the concurrent insertion and deletion of
elements. The main characteristics of the queue are that there is no
fixed limit on the number of outstanding requests and the performance
scales linearly with the number of processors (subject to increasing n
etwork latencies). The queue is implemented using a general-purpose co
mputational model, called the WPRAM. The model includes a shared addre
ss space which uses weak coherency semantics. The implementation makes
extensive use of pairwise synchronization and concurrent atomic opera
tions to achieve scalable performance. The WPRAM is targeted at the cl
ass of distributed memory machines which use a scalable interconnectio
n network.