A major challenge in fine-grained computing is achieving locality with
out excessive scheduling overhead. We built two J-Machine implementati
ons of a fine-grained programming model, the Berkeley Threaded Abstrac
t Machine. One implementation takes an Active Messages approach, maint
aining a scheduling hierarchy in software in order to improve data cac
he performance. Another approach relies on the J-Machine's message que
ues and fast task switch, lowering the control costs at the expense of
data locality. Our analysis measures the costs and benefits of each a
pproach, for a variety of programs and cache configurations. The Activ
e Messages implementation is strongest when miss penalties are high an
d for the finest-grained programs. The hardware-buffered implementatio
n is strongest in direct-mapped caches, where it achieves substantiall
y better instruction cache performance.