This paper describes parallel matrix transpose algorithms on distribut
ed memory concurrent processors. We assume that the matrix is distribu
ted over a P X Q processor template with a block cyclic data distribut
ion. P, Q, and the block size can be arbitrary, so the algorithms have
wide applicability. The communication schemes of the algorithms are d
etermined by the greatest common divisor (GCD) of P and e. If P and Q
are relatively prime, the matrix transpose algorithm involves complete
exchange communication. If P and P are not relatively prime, processo
rs are divided into GCD groups and the communication operations are ov
erlapped for different groups of processors. Processors transpose GCD
wrapped diagonal blocks simultaneously, and the matrix can be transpos
ed with LCM/GCD steps, where LCM is the least common multiple of P and
Q. The algorithms make use of non-blocking, point-to-point communicat
ion between processors. The use of nonblocking communication allows a
processor to overlap the messages that it sends to different processor
s, thereby avoiding unnecessary synchronization. Combined with the mat
rix muliplication routine, C = A . B, the algorithms are used to compu
te parallel multiplications of transposed matrices, C = A(T) . B-T, in
the PUMMA package [5]. Details of the parallel implementation of the
algorithms are given, and results are presented for runs on the Intel
Touchstone Delta computer.