S. Rele et al., Compact and efficient code generation through program restructuring on limited memory embedded DSPs, IEEE COMP A, 20(4), 2001, pp. 477-494
Citations number
38
Categorie Soggetti
Eletrical & Eletronics Engineeing
Journal title
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
Many embedded systems such as digital cameras, digital radios, high-resolut
ion printers, cellular phones, etc., involve a heavy use of signal processi
ng and are thus based on digital signal processors (DSPs), DSPs such as the
TMS320C2x and the DSP5600x have irregular data paths that typically result
due to application specific needs (such as chaining multiply-accumulate op
erations, etc,). Efficient code generation for such embedded DSP processors
is a challenging problem. The stringent requirements such as tight memory
constraints and fast response time result in the need for a compact and eff
icient code. In this paper, we address the problem of generating a compact
and efficient code for embedded DSP processors. Most of the DSP instruction
set architectures (ISAs) feature intrainstruction parallelism (IIP), enabl
ing individual operations to be executed in parallel by generating a comple
x instruction. A reduction in generated code size and improved performance
can be achieved by exploiting this parallelism present in such ISAs, In thi
s paper, we present a code restructuring technique to fully exploit this pa
rallelism through maximal utilization of the complex instructions present i
n the instruction set. We formulate this as a maximal benefit code restruct
uring problem, which is to derive the arrangement of statements to maximall
y exploit IIP without violating data dependencies. This problem is equivale
nt to the precedence constrained Hamiltonian path problem for directed acyc
lic graphs and the traveling salesman problem in general, both of which are
NP-hard, In this paper, me present an optimal algorithm to solve the probl
em, We have implemented this optimal algorithm in a compiler targeted to ge
nerate code for the TMS320C25 DSP. We tested our framework on a number of b
enchmarks and found that the performance of the generated code (measured in
dynamic instruction cycle counts) improves by as much as 9.9% with an aver
age of 4%. The average code-size reduction over code compiled without explo
iting parallelism is 2.9%, We also studied the effect of loop unrolling on
the available IIP. An on-chip instruction cache can be effectively utilized
by unrolling loops such that generated code fully occupies the memory, The
benefit is reduction in dynamic instruction count due to the higher number
of complex instructions generated. We found that by unrolling loop by four
to five times to fit available on-chip instruction cache, the dynamic inst
ruction counts reduce by as much as 9.9%.