Modern computer architecture provides a special instruction - the fused mul
tiply-add (FMA) instruction - to perform both a multiplication and an addit
ion operation at the same time. In this paper newly developed radix-2, radi
x-3, and radix-5 FFT kernels that efficiently take advantage of this powerf
ul instruction are presented. If a processor is provided with FMA instructi
ons, the radix-a FFT algorithm introduced has the lowest complexity of all
Cooley-Tukey radix-2 algorithms. All floating-point operations are executed
as FMA instructions. Compared to conventional radix-3 and radix-5 kernels,
the new radix-3 and radix-5 kernels greatly improve the utilization of FMA
instructions, which results in a significant reduction in complexity. In g
eneral, the advantages of the FFT algorithms presented in this paper are th
eir low arithmetic complexity, their high efficiency, and their striking si
mplicity Numerical experiments show that FFT programs using the new kernels
clearly outperform even the best conventional FFT routines.