Recent fascination for dynamic scheduling as a means for exploiting in
struction-level parallelism has introduced significant interest in the
scalability aspects of dynamic scheduling hardware. In order to overc
ome the scalability problems of centralized hardware schedulers, many
decentralized execution models are being proposed and investigated rec
ently. The crux of all these models is to split the instruction window
across multiple processing elements (PEs) that do independent schedul
ing of instructions. The decentralized execution models proposed so fa
r can be grouped under 3 categories, based on the criterion used for a
ssigning an instruction to a particular PE. They are: (i) execution un
it dependence based decentralization (EDD), (ii) control dependence ba
sed decentralization (CDD), and (iii) data dependence based decentrali
zation (DDD). This paper investigates the performance aspects of these
three decentralization approaches. Using a suite of important benchma
rks and realistic system parameters, we examine performance difference
s resulting from the type of partitioning as well as from specific imp
lementation issues such as the type of PE interconnect. We found that
with a ring-type PE interconnect, the DDD approach performs the best w
hen the number of PEs is moderate, and that the CDD approach performs
best when the number of PEs is large. The currently used approach-EDD-
does not perform well for any configuration. With a realistic crossbar
, performance does not increase with the number of PEs for any of the
partitioning approaches. The results give insight into the best way to
use the transistor budget available for implementing the instruction
window.