D. Takahashi et Y. Kanada, High-performance radix-2, 3 and 5 parallel 1-D complex FFT algorithms for distributed-memory parallel computers, J SUPERCOMP, 15(2), 2000, pp. 207-228
In this paper, we propose high-performance radix-2, 3 and 5 parallel 1-D co
mplex FFT algorithms for distributed-memory parallel computers. We use the
four-step or six-step FFT algorithms to implement the radix-2, 3 and 5 para
llel 1-D complex FFT algorithms. In our parallel FFT algorithms, since we u
se cyclic distribution, all-to-all communication takes place only once. Mor
eover, the input data and output data are both in natural order.
We also show that the suitability of a parallel FFT algorithm is machine-de
pendent because of the differences in the architecture of the processor ele
ments in distributed-memory parallel computers. Experimental results of 2(p
)3(q)5(r) point FFTs on distributed-memory parallel computers, HITACHI SR22
01 and IBM SP2 are reported. We succeeded to get performances of about 130
GFLOPS on a 1024PE HITACHI SR2201 and about 1.25 GFLOPS on a 32PE IBM SP2.