ITA
ENG

APPROACHING A MACHINE-APPLICATION BOUND IN DELIVERED PERFORMANCE ON SCIENTIFIC CODE

Authors

MANGIONESMITH WH SHIH TP ABRAHAM SG DAVIDSON ES

Citation

Wh. Mangionesmith et al., APPROACHING A MACHINE-APPLICATION BOUND IN DELIVERED PERFORMANCE ON SCIENTIFIC CODE, Proceedings of the IEEE, 81(8), 1993, pp. 1166-1178

Citations number

Categorie Soggetti

Engineering, Eletrical & Electronic

Journal title

Proceedings of the IEEE → ACNP

ISSN journal

00189219

Volume

Issue

Year of publication

1993

Pages

1166 - 1178

Database

ISI

SICI code

0018-9219(1993)81:8<1166:AAMBID>2.0.ZU;2-B

Abstract

We have developed a performance bounding methodology that explains the performance of loop-dominated scientific applications on particular s ystems. We model the throughput of key hardware units that are common bottlenecks in concurrent machines. The four units currently used are: memory interface, floating-point, instruction issue, and a ''dependen ce unit'' which is used to model the effects of performance-limiting r ecurrences. We propose a workload characterization, and derive upper b ounds on the performance of specific machine-workload pairs. Comparing delivered performance with bounds focuses attention on areas for impr ovement and indicates how much improvement might be attainable. A deta iled analysis and performance improvement effort for the IBM RS/6000, using the Livermore Fortran Kernels 1-12 to represent the target workl oad, produces an average lower bound of 1.27 clocks per floating-point operation (CPF), whereas machine peak performance is 0.5 CPF and the V2.01 Fortran compiler attains only 2.43 CPF. Code improvements in thi s study have achieved 1.36 CPF, increasing the harmonic mean steady-st ate inner loop performance to 97.6% of the MFLOPS bound. Subsequently, the V2.02 compiler achieved 1.75 CPF, and 1.60 with carefully chosen preprocessing. A goal-directed compiler with bound knowledge could pro duce higher performance code more efficiently and automatically. In ge neral, achieved performance is also affected by cache misses and regis ter spill code. Simple calibration loops are used to characterize cach e performance. The register requirements are characterized as a functi on of the latency and bandwidth of memory and function units for appli cation kernels that have tree structured dependence graphs.