Commodity microprocessors contain more on-chip memory with each successive
generation, and will contain tens of megabytes within the decade. We descri
be a novel architecture that runs an unmodified uniprocessor program across
multiple nodes, each of which contains a processor tightly integrated with
a sizable memory. The execution of instructions is replicated, while the a
ccess of operands is distributed across the nodes. Each node accesses opera
nds in its fast local memory and broadcasts them to the other nodes. This a
rchitecture exploits out-of-order execution and the fact that each chip has
integrated processor and memory, to run memory-intensive, hard-to-parallel
ize programs more efficiently. In this paper, we describe an implementation
with specific solutions to the unique problems that this architecture pose
s. Finally, we conclude by comparing simulation results of our implementati
on to more traditional equivalent systems. In our simulated implementation,
five unmodified SPEC95 binaries ran - in most cases - considerably faster
than in systems with more traditional memory systems. (C) 1999 Elsevier Sci
ence B.V. All rights reserved.