We introduce. computation model for developing and analyzing parallel
algorithms an distributed memory machines. The model allows the design
of algorithms using a single address space and does not assume any pa
rticular interconnection topology. We capture performance by incorpora
ting a cost measure for interprocessor communication induced by remote
memory accesses. The cost measure includes parameters reflecting memo
ry latency, communication bandwidth, and spatial locality. Our model a
llows the initial placement of the input data and pipelined prefetchin
g. We use our model to develop parallel algorithms for various data re
arrangement problems, load balancing, sorting, FFT, and matrix multipl
ication. We show thai most oi these algorithms achieve optimal or near
optimal communication complexity while simultaneously guaranteeing an
optimal speed-up in computational complexity. Ongoing experimental wo
rk in testing and evaluating these algorithms has thus far shown very
promising results.