The semiconductor industry roadmap projects that advances in VLSI technolog
y will permit more than one billion transistors on a chip by the year 2010.
The MIT Raw microprocessor is a proposed architecture that strives to expl
oit these chip-level resources by implementing thousands of tiles, each com
prising a processing element and a small amount of memory, coupled by a sta
tic two-dimensional interconnect. A compiler partitions fine-grain instruct
ion-level parallelism across the tiles and statically schedules infertile c
ommunication over the interconnect. Because Raw microprocessors fully expos
e their internal hardware structure to the software, they can be viewed as
a gigantic FPGA with coarse-grained tiles in which software orchestrates co
mmunication over static interconnections. One open challenge in Raw archite
ctures is to determine their optimal grain size and balance. The grain size
is the area of each tile and the balance is the proportion of area in each
tile devoted to memory, processing, communication, and off-chip global I/O
. if the total chip area is fixed, higher processing power per tile require
s large tiles and hence reduces the total number of tiles on the chip. This
paper presents SimpleFit, a novel analytical framework that designers can
use to reason about the design space of Raw microprocessors. Our model is a
lso generalizable to multiprocessors on a chip. Based on an architectural m
odel, an application model, and a VLSI cost analysis, the framework compute
s the performance of applications and uses an optimization process to ident
ify designs that will execute these applications most cost-effectively. Alt
hough the optimal machine configurations obtained vary for different applic
ations, problem sizes, and budgets, the general trends for various applicat
ions are similar. Accordingly, for the applications studied, assuming a onr
billion logic transistor equivalent area, we recommend building a Raw chip
with approximately 1,000 tiles. 30 words/cycle global I/O, 20 Kbytes of lo
cal memory per tile, three to four words/cycle local communication bandwidt
h, and single-issue processors. This configuration will give performance ne
ar the global optimum for most applications.