Relaxed consistency models have been shown to significantly outperform
sequential consistency for single-issue, statically scheduled process
ors with blocking reads. However, current microprocessors aggressively
exploit instruction-level parallelism (ILP) using methods such as mul
tiple issue, dynamic scheduling, and non-blocking reads. Researchers h
ave conjectured that two techniques, hardware-controlled non-binding p
refetching and speculative loads, have the potential to equalize the h
ardware performance of memory consistency models on such processors, T
his paper performs the first detailed quantitative comparison of sever
al implementations of sequential consistency and release consistency o
ptimized for aggressive ILP processors. Our results indicate that hard
ware prefetching and speculative loads dramatically improve the perfor
mance of sequential consistency. However, the gap between sequential c
onsistency and release consistency depends on the cache write policy a
nd the complexity of the cache-coherence protocol implementation. In m
ost cases, release consistency significantly outperforms sequential co
nsistency, but for two applications, the use of a write-back primary c
ache and a more complex cache-coherence protocol nearly equalizes the
performance of the two models. We also observe that the existing techn
iques, which require on-chip hardware modifications, enhance the perfo
rmance of release consistency only to a smell extent. We propose two n
ew software techniques - fuzzy acquires and selective acquires - to ac
hieve more overlap than allowed by the previous implementations of rel
ease consistency. To enhance methods for overlapping acquires, we also
propose a technique to eliminate control dependences caused by an acq
uire loop, using a small amount of off-chip hardware called the synchr
onization buffer.