ITA
ENG

Compiler algorithms for optimizing locality and parallelism on shared and distributed-memory machines

Authors

Kandemir, M Ramanujam, J Choudhary, A

Citation

M. Kandemir et al., Compiler algorithms for optimizing locality and parallelism on shared and distributed-memory machines, J PAR DISTR, 60(8), 2000, pp. 924-965

Citations number

Categorie Soggetti

Computer Science & Engineering

Journal title

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING

ISSN journal

07437315 → ACNP

Volume

Issue

Year of publication

2000

Pages

924 - 965

Database

ISI

SICI code

0743-7315(200008)60:8<924:CAFOLA>2.0.ZU;2-7

Abstract

Distributed-memory message-passing machines deliver scalable performance bu t are difficult to program. Shared-memory machines, on the other hand, are easier to program but obtaining scalable performance with. large number of processors is difficult. Recently, scalable machines based on logically sha red physically distributed memory have been designed and implemented. While some of the performance issues like parallelism and locality are common to different parallel architectures, issues such as data distribution are uni que to specific architectures. One of the most important challenges compile r writers face is the design of compilation techniques that can work well o n a variety of architectures. In this paper, we propose an algorithm that c an be employed by optimizing compilers for different types of parallel arch itectures. Our optimization algorithm does the following: (1) transforms lo op nests such that, where possible, the iterations of the outer-most loops can be run in parallel across processors; (2) optimizes memory locality by carefully distributing each array across processors; (3) optimizes interpro cessor communication using message vectorization whenever possible; and (4) optimizes cache locality by assigning appropriate storage layout for each array and by transforming the iteration space. Depending on the machine arc hitecture, some or all of these steps can he applied in a unified framework . We report empirical results on an SGI Origin 2000 distributed-shared-memo ry multiprocessor and an IBM SP-2 distributed-memory message-passing machin e to validate the effectiveness of our approach. (C) 2000 Academic Press.