This article describes the Digital Continuous Profiling Infrastructure
, a sampling-based profiling system designed to run continuously on pr
oduction systems. The system supports multiprocessors, works on unmodi
fied executables, and collects profiles for entire systems, including
user programs, shared libraries, and the operating system kernel. Samp
les are collected at a high rate (over 5200 samples/sec, per 333MHz pr
ocessor), yet with low overhead (1-3% slowdown for most workloads). An
alysis tools supplied with the profiling system use the sample data to
produce a precise and accurate accounting, down to the level of pipel
ine stalls incurred by individual instructions, of where time is being
spent, When instructions incur stalls, the tools identify possible re
asons, such as cache misses, branch mispredictions, and functional uni
t contention. The fine-grained instruction-level analysis guides users
and automated optimizers to the causes of performance problems and pr
ovides important insights for fixing them.