Embedded systems require maximum performance from a processor within signif
icant constraints in power consumption and chip cost. Using software pipeli
ning, processors can often exploit considerable instruction-level paralleli
sm (ILP), and thus significantly improve performance, at the cost of substa
ntially increasing register requirements. These increasing register require
ments, however, make it difficult to build a high-performance embedded proc
essor with a single, multi-ported register file while maintaining clock spe
ed and limiting power consumption.
Some digital signal processors, such as the TI C6x, reduce the number of po
rts required for a register bank by partitioning the register bank into mul
tiple banks. Disjoint subsets of functional units are directly connected to
one of the partitioned register banks. Each register bank and its associat
e functional units is called a cluster. Clustering reduces the number of po
rts needed on a per-bank basis, allowing an increased clock rate. However,
execution speed can be hampered because of the potential need to copy "non-
local" operands among register banks in order to make them available to the
functional unit performing an operation. The task of the compiler is to bo
th maximize parallelism and minimize the number of remote register accesses
needed.
Previous work has concentrated on methods to partition virtual registers am
ongst the target architecture's clusters. In this paper, we show haw high-l
evel loop transformations can enhance the partitioning obtained by low-leve
l schemes. In our experiments, loop transformations improved software pipel
ining by 27% on a machine with 2 clusters, each having 1 floating-point and
1 integer register bank and 4 functional units. We also observed a 20% imp
rovement on a similar machine with 4 clusters of 2 functional units. In fac
t, by performing the described loop transformations we were able to show im
provements of greater than 10% over schedules (for un-transformed loops) ge
nerated with the unrealistic assumption of a single multi-ported register b
ank.