This paper presents a simulation-based performance prediction framework for
large-scale, data-intensive applications on large-scale machines. The fram
ework consists of two components: application emulators and a suite of simu
lators. Application emulators provide a parameterized model of data access
and computation patterns of the applications and enable changing critical a
pplication components (input data partitioning, data declustering, processi
ng structure, etc.). The suite of simulators executes quickly on a high per
formance workstation to allow performance prediction of large-scale paralle
l machine configurations. The key to efficient simulation of very large con
figurations is to elide the majority of low-level hardware events while pre
serving data dependencies and distributions. The authors evaluate their per
formance prediction tool using a set of three data-intensive applications.