We present a novel practical algorithm for Cholesky factorization when the
matrix is stored in packed format by combining blocking and recursion. The
algorithm simultaneously obtains Level 3 performance, conserves about half
the storage, and avoids the production of Level 3 BLAB for packed format. W
e use recursive packed format, which was first described by Andersen et al.
[1]. Our algorithm uses only DGEMM and Level 3 kernel routines; it first t
ransforms standard packed format to packed recursive lower row format. Our
new algorithm outperforms the Level 3 LAPACK routine DPOTRF even when we in
clude the cost of data transformation. (This is true for three IBM platform
s-the POWER3, the POWER2, and the PowerPC 604e.) For large matrices, blocki
ng is not required for acceptable Level 3 performance. However, for small m
atrices the overhead of pure recursion and/or data transformation is too hi
gh. We analyze these costs analytically and provide detailed cost estimates
. We show that blocking combined with recursion reduces all overheads to a
tiny, acceptable level. However, a new problem of nonlinear addressing aris
es. We use two-dimensional mappings (tables) or data copying to overcome th
e high costs of directly computing addresses that are nonlinear functions o
f i and j.