Wh. Mangionesmith et al., APPROACHING A MACHINE-APPLICATION BOUND IN DELIVERED PERFORMANCE ON SCIENTIFIC CODE, Proceedings of the IEEE, 81(8), 1993, pp. 1166-1178
We have developed a performance bounding methodology that explains the
performance of loop-dominated scientific applications on particular s
ystems. We model the throughput of key hardware units that are common
bottlenecks in concurrent machines. The four units currently used are:
memory interface, floating-point, instruction issue, and a ''dependen
ce unit'' which is used to model the effects of performance-limiting r
ecurrences. We propose a workload characterization, and derive upper b
ounds on the performance of specific machine-workload pairs. Comparing
delivered performance with bounds focuses attention on areas for impr
ovement and indicates how much improvement might be attainable. A deta
iled analysis and performance improvement effort for the IBM RS/6000,
using the Livermore Fortran Kernels 1-12 to represent the target workl
oad, produces an average lower bound of 1.27 clocks per floating-point
operation (CPF), whereas machine peak performance is 0.5 CPF and the
V2.01 Fortran compiler attains only 2.43 CPF. Code improvements in thi
s study have achieved 1.36 CPF, increasing the harmonic mean steady-st
ate inner loop performance to 97.6% of the MFLOPS bound. Subsequently,
the V2.02 compiler achieved 1.75 CPF, and 1.60 with carefully chosen
preprocessing. A goal-directed compiler with bound knowledge could pro
duce higher performance code more efficiently and automatically. In ge
neral, achieved performance is also affected by cache misses and regis
ter spill code. Simple calibration loops are used to characterize cach
e performance. The register requirements are characterized as a functi
on of the latency and bandwidth of memory and function units for appli
cation kernels that have tree structured dependence graphs.