As the instruction issue width of superscalar processors increases, instruc
tion fetch bandwidth requirements will also increase. It will eventually be
come necessary to fetch multiple basic blocks per clock cycle. Conventional
instruction caches hinder this effort because long instruction sequences a
re not always in contiguous cache locations. Trace caches overcome this lim
itation by caching traces of the dynamic instruction stream, so instruction
s that are otherwise noncontiguous appear contiguous. In this paper, we pre
sent and evaluate a microarchitecture incorporating a trace cache. The micr
oarchitecture provides high instruction fetch bandwidth with low latency by
explicitly sequencing through the program at the higher level of traces, b
oth in terms of 1) control flow prediction and 2) instruction supply. For t
he SPEC95 integer benchmarks, trace-level sequencing improves performance f
rom 15 percent to 35 percent over an otherwise equally sophisticated, but c
ontiguous, multiple-block fetch mechanism. Most of this performance improve
ment is due to the trace cache. However, for one benchmark whose performanc
e is limited by branch mispredictions, the performance gain is almost entir
ely due to improved prediction accuracy.