E. Elmroth et Fg. Gustavson, Applying recursion to serial and parallel QR factorization leads to betterperformance, IBM J RES, 44(4), 2000, pp. 605-624
We present new recursive serial and parallel algorithms for QR factorizatio
n of an m by n matrix. They improve performance. The recursion leads to an
automatic variable blocking, and it also replaces a Level 2 part in a stand
ard block algorithm with Level 3 operations. However, there are significant
additional costs for creating and performing the updates, which prohibit t
he efficient use of the recursion for large n. We present a quantitative an
alysis of these extra costs. This analysis leads us to introduce a hybrid r
ecursive algorithm that outperforms the LAPACK algorithm DGEQRF by about 20
% for large square matrices and up to almost a factor of 3 for tall thin ma
trices. Uniprocessor performance results are presented for two IBM RS/6000(
R) SP nodes-a 120-MHz IBM POWER2 node and one processor of a four-way 332-M
Hz IBM PowerPC(R) 604e SMP node. The hybrid recursive algorithm reaches mor
e than 90% of the theoretical peak performance of the POWER2 node, Compared
to standard block algorithms, the recursive approach also shows a signific
ant advantage in the automatic tuning obtained from its automatic variable
blocking. A successful parallel implementation on a four-way 332-MHz IBM PP
C604e SMP node based on dynamic load balancing is presented. For two, three
, and four processors it shows speedups of up to 1.97, 2.99, and 3.97.