Modern languages and operating systems often encourage programmers to
use threads, or independent control streams, to mask the overhead of s
ome operations and simplify program structure. Multitasking operating
systems use threads to mask communication latency, either with hardwar
es devices or users. Client-server applications typically use threads
to simplify the complex control-flow that arises when multiple clients
are used. Recently, the scientific computing community has started us
ing threads to mask network communication latency in massively paralle
l architectures, allowing computation and communication to be overlapp
ed. Lastly, some architectures implement threads in hardware, using th
ose threads to tolerate memory latency. In general, it would be desira
ble if threaded programs could be written to expose the largest degree
of parallelism possible, or to simplify the program design. However,
threads incur time and space overheads, and programmers often compromi
se simple designs for performance. In this paper, we show how to reduc
e time and space thread overhead using control flow and register liven
ess information inferred after compilation. Our techniques work on bin
aries, are not specific to a particular compiler or thread library and
reduce the the overall execution time of fine-grain threaded programs
by approximate to 15 - 30%. We use execution-driven analysis and an i
nstrumented operating system to show why the execution time is reduced
and to indicate areas for future work.