AREA AND PERFORMANCE TRADEOFFS IN FLOATING-POINT DIVIDE AND SQUARE-ROOT IMPLEMENTATIONS

Citation
P. Soderquist et M. Leeser, AREA AND PERFORMANCE TRADEOFFS IN FLOATING-POINT DIVIDE AND SQUARE-ROOT IMPLEMENTATIONS, ACM computing surveys, 28(3), 1996, pp. 518-564
Citations number
64
Categorie Soggetti
Computer Sciences","Computer Science Theory & Methods
Journal title
ISSN journal
03600300
Volume
28
Issue
3
Year of publication
1996
Pages
518 - 564
Database
ISI
SICI code
0360-0300(1996)28:3<518:AAPTIF>2.0.ZU;2-3
Abstract
Floating-point. divide and square-root operations are essential to man y scientific and engineering applications, and are required in all com puter systems that support the IEEE floating-point standard. Yet many current microprocessors provide only weak support for these operations . The latency and throughput of division are typically far inferior to those of floating-point addition and multiplication, and square-root performance is often even lower. This article argues the case for high -performance division and square root. It also explains the algorithms and implementations of the primary techniques, subtractive and multip licative methods, employed in microprocessor floating-point units with their associated area/performance tradeoffs. Case studies of represen tative floating-point unit configurations are presented, supported by simulation results using a carefully selected benchmark, Givens rotati on, to show the dynamic performance impact of the various implementati on alternatives. The topology of the implementation is found to be an important performance factor. Multiplicative algorithms, such as the N ewton-Raphson method and Goldschmidt's algorithm, can achieve low late ncies. However, these implementations serialize multiply, divide, and square root operations through a single pipeline, which can lead to lo w throughput. While this hardware sharing yields low size requirements for baseline implementations, lower-latency versions require many tim es more area. For these reasons, multiplicative implementations are be st suited to cases where subtractive methods are precluded by area con straints, and modest performance on divide and square root operations is tolerable. Subtractive algorithms, exemplified by radix-4 SRT and r adix-16 SRT, can be made to execute in parallel with other floating-po int operations. Combined with their reasonable area requirements, this gives these implementations a favorable balance of performance and ar ea across different: floating-point unit configurations. Recent develo pments in microprocessor technology, such as decoupled superscalar imp lementations and increasing instruction issue rates, also favor the pa rallel, independent operation afforded by subtractive methods.