P. Soderquist et M. Leeser, AREA AND PERFORMANCE TRADEOFFS IN FLOATING-POINT DIVIDE AND SQUARE-ROOT IMPLEMENTATIONS, ACM computing surveys, 28(3), 1996, pp. 518-564
Citations number
64
Categorie Soggetti
Computer Sciences","Computer Science Theory & Methods
Floating-point. divide and square-root operations are essential to man
y scientific and engineering applications, and are required in all com
puter systems that support the IEEE floating-point standard. Yet many
current microprocessors provide only weak support for these operations
. The latency and throughput of division are typically far inferior to
those of floating-point addition and multiplication, and square-root
performance is often even lower. This article argues the case for high
-performance division and square root. It also explains the algorithms
and implementations of the primary techniques, subtractive and multip
licative methods, employed in microprocessor floating-point units with
their associated area/performance tradeoffs. Case studies of represen
tative floating-point unit configurations are presented, supported by
simulation results using a carefully selected benchmark, Givens rotati
on, to show the dynamic performance impact of the various implementati
on alternatives. The topology of the implementation is found to be an
important performance factor. Multiplicative algorithms, such as the N
ewton-Raphson method and Goldschmidt's algorithm, can achieve low late
ncies. However, these implementations serialize multiply, divide, and
square root operations through a single pipeline, which can lead to lo
w throughput. While this hardware sharing yields low size requirements
for baseline implementations, lower-latency versions require many tim
es more area. For these reasons, multiplicative implementations are be
st suited to cases where subtractive methods are precluded by area con
straints, and modest performance on divide and square root operations
is tolerable. Subtractive algorithms, exemplified by radix-4 SRT and r
adix-16 SRT, can be made to execute in parallel with other floating-po
int operations. Combined with their reasonable area requirements, this
gives these implementations a favorable balance of performance and ar
ea across different: floating-point unit configurations. Recent develo
pments in microprocessor technology, such as decoupled superscalar imp
lementations and increasing instruction issue rates, also favor the pa
rallel, independent operation afforded by subtractive methods.