In this paper, we describe an efficient and scalable implementation of
the NAS Parallel Benchmark BT suitable for distributed memory systems
such as the IBM Scalable POWERparallei Systems(R). After describing t
he parallelization and data partitioning methods used, we outline some
of the optimization steps used to realize good performance on individ
ual processors and to reduce the communication overheads on the IBM SP
1(TM) and SP2(TM) systems. We present performance results on up to 128
nodes of the SP1, and on the SP2 with wide nodes. We describe the per
formance on the standard Class A and Class B problem sets. To show the
scalability of our parallelization methods, we present the performanc
e of two additional data sets.