Current microprocessors incorporate techniques to aggressively exploit inst
ruction-level parallelism (ILP). This paper evaluates the impact of such pr
ocessors on the performance of shared-memory multiprocessors, both without
and with the latency-hiding optimization of software prefetching. Our resul
ts show that, while ILP techniques substantially reduce CPU time in multipr
ocessors, they are less effective in removing memory stall time. Consequent
ly, despite the inherent latency tolerance features of ILP processors, we f
ind memory system performance to be a larger bottleneck and parallel effici
encies to be generally poorer in ILP-based multiprocessors than in previous
generation multiprocessors. The main reasons for these deficiencies are in
sufficient opportunities in the applications to overlap multiple load misse
s and increased contention for resources in the system. We also find that s
oftware prefetching does not change the memory bound nature of most of our
applications on our ILP multiprocessor, mainly due to a large number of lat
e prefetches and resource contention. Our results suggest the need for addi
tional latency hiding or reducing techniques for ILP systems, such as softw
are clustering of load misses and producer-initiated communication.