Loop transformations for architectures with partitioned register banks

Citation
Xl. Huang et al., Loop transformations for architectures with partitioned register banks, ACM SIGPL N, 36(8), 2001, pp. 48-55
Citations number
27
Categorie Soggetti
Computer Science & Engineering
Journal title
ACM SIGPLAN NOTICES
ISSN journal
15232867 → ACNP
Volume
36
Issue
8
Year of publication
2001
Pages
48 - 55
Database
ISI
SICI code
1523-2867(200108)36:8<48:LTFAWP>2.0.ZU;2-J
Abstract
Embedded systems require maximum performance from a processor within signif icant constraints in power consumption and chip cost. Using software pipeli ning, processors can often exploit considerable instruction-level paralleli sm (ILP), and thus significantly improve performance, at the cost of substa ntially increasing register requirements. These increasing register require ments, however, make it difficult to build a high-performance embedded proc essor with a single, multi-ported register file while maintaining clock spe ed and limiting power consumption. Some digital signal processors, such as the TI C6x, reduce the number of po rts required for a register bank by partitioning the register bank into mul tiple banks. Disjoint subsets of functional units are directly connected to one of the partitioned register banks. Each register bank and its associat e functional units is called a cluster. Clustering reduces the number of po rts needed on a per-bank basis, allowing an increased clock rate. However, execution speed can be hampered because of the potential need to copy "non- local" operands among register banks in order to make them available to the functional unit performing an operation. The task of the compiler is to bo th maximize parallelism and minimize the number of remote register accesses needed. Previous work has concentrated on methods to partition virtual registers am ongst the target architecture's clusters. In this paper, we show haw high-l evel loop transformations can enhance the partitioning obtained by low-leve l schemes. In our experiments, loop transformations improved software pipel ining by 27% on a machine with 2 clusters, each having 1 floating-point and 1 integer register bank and 4 functional units. We also observed a 20% imp rovement on a similar machine with 4 clusters of 2 functional units. In fac t, by performing the described loop transformations we were able to show im provements of greater than 10% over schedules (for un-transformed loops) ge nerated with the unrealistic assumption of a single multi-ported register b ank.