ITA
ENG

Loop transformations for architectures with partitioned register banks

Authors

Huang, XL Carr, S Sweany, P

Citation

Xl. Huang et al., Loop transformations for architectures with partitioned register banks, ACM SIGPL N, 36(8), 2001, pp. 48-55

Citations number

Categorie Soggetti

Computer Science & Engineering

Journal title

ACM SIGPLAN NOTICES

ISSN journal

15232867 → ACNP

Volume

Issue

Year of publication

2001

Pages

48 - 55

Database

ISI

SICI code

1523-2867(200108)36:8<48:LTFAWP>2.0.ZU;2-J

Abstract

Embedded systems require maximum performance from a processor within signif icant constraints in power consumption and chip cost. Using software pipeli ning, processors can often exploit considerable instruction-level paralleli sm (ILP), and thus significantly improve performance, at the cost of substa ntially increasing register requirements. These increasing register require ments, however, make it difficult to build a high-performance embedded proc essor with a single, multi-ported register file while maintaining clock spe ed and limiting power consumption. Some digital signal processors, such as the TI C6x, reduce the number of po rts required for a register bank by partitioning the register bank into mul tiple banks. Disjoint subsets of functional units are directly connected to one of the partitioned register banks. Each register bank and its associat e functional units is called a cluster. Clustering reduces the number of po rts needed on a per-bank basis, allowing an increased clock rate. However, execution speed can be hampered because of the potential need to copy "non- local" operands among register banks in order to make them available to the functional unit performing an operation. The task of the compiler is to bo th maximize parallelism and minimize the number of remote register accesses needed. Previous work has concentrated on methods to partition virtual registers am ongst the target architecture's clusters. In this paper, we show haw high-l evel loop transformations can enhance the partitioning obtained by low-leve l schemes. In our experiments, loop transformations improved software pipel ining by 27% on a machine with 2 clusters, each having 1 floating-point and 1 integer register bank and 4 functional units. We also observed a 20% imp rovement on a similar machine with 4 clusters of 2 functional units. In fac t, by performing the described loop transformations we were able to show im provements of greater than 10% over schedules (for un-transformed loops) ge nerated with the unrealistic assumption of a single multi-ported register b ank.