ITA
ENG

AREA AND PERFORMANCE TRADEOFFS IN FLOATING-POINT DIVIDE AND SQUARE-ROOT IMPLEMENTATIONS

Authors

SODERQUIST P LEESER M

Citation

P. Soderquist et M. Leeser, AREA AND PERFORMANCE TRADEOFFS IN FLOATING-POINT DIVIDE AND SQUARE-ROOT IMPLEMENTATIONS, ACM computing surveys, 28(3), 1996, pp. 518-564

Citations number

Categorie Soggetti

Computer Sciences","Computer Science Theory & Methods

Journal title

ACM computing surveys → ACNP

ISSN journal

03600300

Volume

Issue

Year of publication

1996

Pages

518 - 564

Database

ISI

SICI code

0360-0300(1996)28:3<518:AAPTIF>2.0.ZU;2-3

Abstract

Floating-point. divide and square-root operations are essential to man y scientific and engineering applications, and are required in all com puter systems that support the IEEE floating-point standard. Yet many current microprocessors provide only weak support for these operations . The latency and throughput of division are typically far inferior to those of floating-point addition and multiplication, and square-root performance is often even lower. This article argues the case for high -performance division and square root. It also explains the algorithms and implementations of the primary techniques, subtractive and multip licative methods, employed in microprocessor floating-point units with their associated area/performance tradeoffs. Case studies of represen tative floating-point unit configurations are presented, supported by simulation results using a carefully selected benchmark, Givens rotati on, to show the dynamic performance impact of the various implementati on alternatives. The topology of the implementation is found to be an important performance factor. Multiplicative algorithms, such as the N ewton-Raphson method and Goldschmidt's algorithm, can achieve low late ncies. However, these implementations serialize multiply, divide, and square root operations through a single pipeline, which can lead to lo w throughput. While this hardware sharing yields low size requirements for baseline implementations, lower-latency versions require many tim es more area. For these reasons, multiplicative implementations are be st suited to cases where subtractive methods are precluded by area con straints, and modest performance on divide and square root operations is tolerable. Subtractive algorithms, exemplified by radix-4 SRT and r adix-16 SRT, can be made to execute in parallel with other floating-po int operations. Combined with their reasonable area requirements, this gives these implementations a favorable balance of performance and ar ea across different: floating-point unit configurations. Recent develo pments in microprocessor technology, such as decoupled superscalar imp lementations and increasing instruction issue rates, also favor the pa rallel, independent operation afforded by subtractive methods.