In this paper, we evaluate the impact on performance of various implementat
ion techniques for collective I/O operations, and we do so across four impo
rtant parallel architectures. We show that a naive implementation of collec
tive I/O does not result in significant performance gains for any of the ar
chitectures, but that an optimized implementation does provide excellent pe
rformance across till of the platforms under study. Furthermore, we demonst
rate that there exists a single implementation strategy that provides the b
est performance for all four computational platforms. Next, we evaluate imp
lementation techniques for thread-based collective I/O operations. We show
that the most obvious implementation technique, which is to spawn a thread
to execute the whole collective I/O operation in the background. frequently
provides the worst performance, often performing much worse than just exec
uting the collective I/O routine entirely in the foreground. To improve per
formance, we explore an alternate approach where part of the collective I/O
operation is performed in the background, and part is performed in the for
eground. We demonstrate that this implementation technique can provide sign
ificant performance gains, offering up to a 50% improvement over implementa
tions that do not attempt to overlap collective I/O and computation. (C) 20
01 Academic Press.