Y. He et Chq. Ding, Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications, J SUPERCOMP, 18(3), 2001, pp. 259-277
Numerical reproducibility and stability of large scale scientific simulatio
ns, especially climate modeling, on distributed memory parallel computers a
re becoming critical issues. In particular, global summation of distributed
arrays is most susceptible to rounding errors, and their propagation and a
ccumulation cause uncertainty in final simulation results. We analyzed seve
ral accurate summation methods and found that two methods are particularly
effective to improve (ensure) reproducibility and stability: Kahan's self-c
ompensated summation and Bailey's double-double precision summation. We pro
vide an MPI operator MPI_SUMDD to work with MPI collective operations to en
sure a scalable implementation on large number of processors. The final met
hods are particularly simple to adopt in practical codes: not only global s
ummations, but also vector-vector dot products and matrix-vector or matrix-
matrix operations.