Z. Luo et M. Martonosi, Accelerating pipelined integer and floating-point accumulations in configurable hardware with delayed addition techniques, IEEE COMPUT, 49(3), 2000, pp. 208-218
The speed of arithmetic calculations in configurable hardware is limited by
carry propagation, even with the dedicated hardware found in recent FPGAs.
This paper proposes and evaluates an approach called delayed addition that
reduces the carry-propagation bottleneck and improves the performance of a
rithmetic calculations. Our approach employs the idea used in Wallace trees
to store the results in an intermediate form and delay addition until the
end of a repeated calculation such as accumulation or dot-product; this eff
ectively removes carry propagation overhead from the calculation's critical
path. We present both integer and floating-point designs that use our tech
nique. Our pipelined integer multiply-accumulate (MAC) design is based on a
fairly traditional multiplier design, but with delayed addition as well. T
his design achieves a 72MHz clock rate on an XC4036xla-9 FPGA and 170MHz cl
ock rate on an XV300epq240-8 FPGA. Next, we present a 32-bit floating-point
accumulator based on delayed addition. Here, delayed addition requires a n
ovel alignment technique that decouples the incoming operands from the accu
mulated result. A conservative version of this design achieves a 40 MHz clo
ck rate on an XC4036xla-9 FPGA and 97MHz clock rate on an XV100epq240-8 FPG
A. We also present a 32-bit floating-point accumulator design with compiler
-managed overflow avoidance that achieves a 80MHz clock rate on an XC4036xl
a-9 FPGA and 150MHz clock rate on an XCV100epq240-8 FPGA.