ITA
ENG

PARALLEL MATRIX TRANSPOSE ALGORITHMS ON DISTRIBUTED-MEMORY CONCURRENTCOMPUTERS

Authors

CHOI JY DONGARRA JJ WALKER DW

Citation

Jy. Choi et al., PARALLEL MATRIX TRANSPOSE ALGORITHMS ON DISTRIBUTED-MEMORY CONCURRENTCOMPUTERS, Parallel computing, 21(9), 1995, pp. 1387-1405

Citations number

Categorie Soggetti

Computer Sciences","Computer Science Theory & Methods

Journal title

Parallel computing → ACNP

ISSN journal

01678191

Volume

Issue

Year of publication

1995

Pages

1387 - 1405

Database

ISI

SICI code

0167-8191(1995)21:9<1387:PMTAOD>2.0.ZU;2-L

Abstract

This paper describes parallel matrix transpose algorithms on distribut ed memory concurrent processors. We assume that the matrix is distribu ted over a P X Q processor template with a block cyclic data distribut ion. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are d etermined by the greatest common divisor (GCD) of P and e. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and P are not relatively prime, processo rs are divided into GCD groups and the communication operations are ov erlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transpos ed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communicat ion between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processor s, thereby avoiding unnecessary synchronization. Combined with the mat rix muliplication routine, C = A . B, the algorithms are used to compu te parallel multiplications of transposed matrices, C = A(T) . B-T, in the PUMMA package [5]. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.