PARALLEL MATRIX TRANSPOSE ALGORITHMS ON DISTRIBUTED-MEMORY CONCURRENTCOMPUTERS

Citation
Jy. Choi et al., PARALLEL MATRIX TRANSPOSE ALGORITHMS ON DISTRIBUTED-MEMORY CONCURRENTCOMPUTERS, Parallel computing, 21(9), 1995, pp. 1387-1405
Citations number
14
Categorie Soggetti
Computer Sciences","Computer Science Theory & Methods
Journal title
ISSN journal
01678191
Volume
21
Issue
9
Year of publication
1995
Pages
1387 - 1405
Database
ISI
SICI code
0167-8191(1995)21:9<1387:PMTAOD>2.0.ZU;2-L
Abstract
This paper describes parallel matrix transpose algorithms on distribut ed memory concurrent processors. We assume that the matrix is distribu ted over a P X Q processor template with a block cyclic data distribut ion. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are d etermined by the greatest common divisor (GCD) of P and e. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and P are not relatively prime, processo rs are divided into GCD groups and the communication operations are ov erlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transpos ed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communicat ion between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processor s, thereby avoiding unnecessary synchronization. Combined with the mat rix muliplication routine, C = A . B, the algorithms are used to compu te parallel multiplications of transposed matrices, C = A(T) . B-T, in the PUMMA package [5]. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.