Current microprocessors utilise the instruction-level parallelism by a deep
processor pipeline and the superscalar instruction issue technique. VLSI t
echnology offers several solutions for aggressive exploitation of the instr
uction-level parallelism in future generations of microprocessors. Technolo
gical advances will replace the gate delay by on-chip wire delay as the mai
n obstacle to increase the chip complexity and cycle rate. The implication
for the microarchitecture is that functionally partitioned designs with str
ict nearest neighbour connections must be developed. Among the major proble
ms facing the microprocessor designers is the application of even higher de
gree of speculation in combination with functional partitioning of the proc
essor, which prepares the way for exceeding the classical dataflow limit im
posed by data dependences, in this paper we survey the current approaches t
o solving this problem, in particular we analyse several new research direc
tions whose solutions are based on the complex uniprocessor architecture. A
uniprocessor chip features a very aggressive superscalar design combined w
ith a trace cache and superspeculative techniques. Superspeculative techniq
ues exceed the classical dataflow Limit where even with unlimited machine r
esources a program cannot execute any faster than the execution of the long
est dependence chain introduced by the program's data dependences. Superspe
culative processors also speculate about control dependences. The trace cac
he stores the dynamic instruction traces contiguously and fetches instructi
ons from the trace cache rather than from the instruction cache. Since a dy
namic trace of instructions may contain multiple taken branches, there is n
o need to fetch from multiple targets, as would be necessary when predictin
g multiple branches and fetching 16 or 32 instructions from the instruction
cache. Multiscalar and trace processors define several processing cores th
at speculatively execute different parts of a sequential program in paralle
l. Multiscalar processors use a compiler to partition the program segments,
whereas a trace processor uses a trace cache to generate dynamically trace
segments for the processing cores. A datascalar processor runs the same se
quential program redundantly on several processing elements where each proc
essing element has different data set. This paper discusses and compares th
e performance potential of these complex uniprocessors. (C) 2000 Elsevier S
cience B.V. All rights reserved.