Static symbolic factorization coupled with supernode partitioning and async
hronous computation scheduling can achieve high gigaflop rates for parallel
sparse LU factorization with partial pivoting. This paper studies properti
es of elimination forests and uses them to optimize supernode partitioning/
amalgamation and execution scheduling. It also proposes supernodal matrix m
ultiplication to speed up kernel computation by retaining the BLAS-3 level
efficiency and avoiding unnecessary arithmetic operations. The experiments
show that our new design with proper space optimization, called S+, improve
s our previous solution substantially and can achieve up to 10 GFLOPS on 12
8 Cray T3E 450MHz nodes.