APPROACHING A MACHINE-APPLICATION BOUND IN DELIVERED PERFORMANCE ON SCIENTIFIC CODE

Citation
Wh. Mangionesmith et al., APPROACHING A MACHINE-APPLICATION BOUND IN DELIVERED PERFORMANCE ON SCIENTIFIC CODE, Proceedings of the IEEE, 81(8), 1993, pp. 1166-1178
Citations number
31
Categorie Soggetti
Engineering, Eletrical & Electronic
Journal title
ISSN journal
00189219
Volume
81
Issue
8
Year of publication
1993
Pages
1166 - 1178
Database
ISI
SICI code
0018-9219(1993)81:8<1166:AAMBID>2.0.ZU;2-B
Abstract
We have developed a performance bounding methodology that explains the performance of loop-dominated scientific applications on particular s ystems. We model the throughput of key hardware units that are common bottlenecks in concurrent machines. The four units currently used are: memory interface, floating-point, instruction issue, and a ''dependen ce unit'' which is used to model the effects of performance-limiting r ecurrences. We propose a workload characterization, and derive upper b ounds on the performance of specific machine-workload pairs. Comparing delivered performance with bounds focuses attention on areas for impr ovement and indicates how much improvement might be attainable. A deta iled analysis and performance improvement effort for the IBM RS/6000, using the Livermore Fortran Kernels 1-12 to represent the target workl oad, produces an average lower bound of 1.27 clocks per floating-point operation (CPF), whereas machine peak performance is 0.5 CPF and the V2.01 Fortran compiler attains only 2.43 CPF. Code improvements in thi s study have achieved 1.36 CPF, increasing the harmonic mean steady-st ate inner loop performance to 97.6% of the MFLOPS bound. Subsequently, the V2.02 compiler achieved 1.75 CPF, and 1.60 with carefully chosen preprocessing. A goal-directed compiler with bound knowledge could pro duce higher performance code more efficiently and automatically. In ge neral, achieved performance is also affected by cache misses and regis ter spill code. Simple calibration loops are used to characterize cach e performance. The register requirements are characterized as a functi on of the latency and bandwidth of memory and function units for appli cation kernels that have tree structured dependence graphs.