The supercomputer market is now dominated by parallel architectures, a
mong which massively parallel computers (MPCs) are an important class
of systems. The memory of an MPC is physically distributed among an en
semble of computing nodes that communicate by sending data through a n
etwork. Communication operations can be either point-to-point, with on
e source and one destination, or collective, with more than two partic
ipating processes. The design of collective communication operations d
epends on the MPC's underlying network architecture. While there has b
een little consensus on some aspects of communication architectures, s
uch as network topology, a good deal of agreement exists regarding the
most efficient way to switch messages through the network. Most MPCs
use wormhole routing, in which each message is divided into small piec
es that are pipelined through the network. Compared with the store-and
-forward switching method used in early multicomputers, wormhole routi
ng reduces the effect of path length on communication time. However, i
n situations where multiple messages exist in the network concurrently
, wormhole routing can exacerbate channel contention, which occurs whe
n blocked messages hold some communication channels while waiting for
others. Invoking a collective operation, which can involve many messag
es, poses this situation. In recent years, many projects have addresse
d the design of efficient collective communication algorithms for worm
hole-routed systems. By exploiting the relative distance-insensitivity
of wormhole routing, these new algorithms often differ fundamentally
from their store-and-forward counterparts. This article examines softw
are and hardware approaches to implementing collective communication o
perations, illustrating several issues arising in this research area a
nd describing the major classes of algorithms proposed to solve these
problems.