Wide-issue processors continue to achieve higher performance by exploiting
greater instruction-level parallelism. Dynamic techniques such as out-of-or
der execution and hardware speculation have proven effective at increasing
instruction throughput. Runtime optimization promises to provide an even hi
gher level of performance by adaptively applying aggressive code transforma
tions on a larger scope. This paper presents a new hardware mechanism for g
enerating and deploying runtime optimized code. The mechanism can be viewed
as a filtering system that resides in the retirement stage of the processo
r pipeline, accepts an instruction execution stream as input, and produces
instruction profiles and sets of linked, optimized traces as output. The co
de deployment mechanism uses an extension to the branch prediction mechanis
m to migrate execution into the new code without modifying the original cod
e. These new components do not add delay to the execution of the program ex
cept during short bursts of reoptimization. This technique provides a stron
g platform for runtime optimization because the hot execution regions are e
xtracted, optimized, and written to main memory for execution and because t
hese regions persist across context switches. The current design of the fra
mework supports a suite of optimizations, including partial function inlini
ng (even into shared libraries), code straightening optimizations, loop unr
olling, and peephole optimizations.