In this paper we consider the CRAY APP, the Attached Parallel Processo
r of the CRAY S-MP, which consists of seven buses with each bus suppor
ting up to 12 processing elements. Processing elements on different bu
ses can communicate simultaneously with the shared main memory, but pr
ocessing elements sharing the Same bus can not, since only one process
ing element per bus can access memory at a given time. Applications wi
th a high level of data reuse, or, with a high computation intensity,
and applications being highly parallel are very suitable to run on the
APP. An example of such an algorithm is matrix-matrix multiplication.
We illustrate how the data traffic's restriction influences the perfo
rmance and we discuss a performance model of the bus architecture, con
sidering a change in processor speed, data traffic speed and cache con
tents. Furthermore, two different algorithms for Cholesky factorizatio
n are discussed: a block left-looking algorithm and a block right-look
ing algorithm. The maximum achievable speed on the GRAY APP is mainly
determined by the performance of the matrix-matrix multiplication. Par
allelism is applied explicitly over the blocks, which makes it possibl
e to concatenate different block operations in cache. The results obta
ined on CWI's APP (a machine having twenty-eight processing elements)
indicate how block algorithms can be parallelized on machines with hun
dreds or thousands of processors.