Fast Fourier transforms parallelize well but need large amounts of com
munication. An algorithm which concentrates all the communication in o
ne or two transposition steps is the transpose split algorithm, Differ
ent transposition algorithms can be used depending on data size and co
mmunication latency. A new transpose split algorithm for real and herm
itian data is presented for one, two and three dimensional transforms.
This algorithm is implemented on the Fujitsu VPP 500. The Fujitsu VPP
500 is a parallel processor with a moderate number of very fast vecto
r processors connected by a crossbar switch. Each processor has a peak
performance of 1.6 Gflop/s and can simultaneously read and write 400
MByte/s. Very long vector length stride one implementations of multipl
e FFTs on one node, as described by the author in 1994, are combined w
ith optimized transpositions. One third of peak performance was achiev
ed on a configuration with up to 32 processors.