We present implementation and performance issues of a data parallel ve
rsion of the National Center for Atmospheric Research (NCAR) Community
Climate Model (CCM2). We describe automatic conversion tools used to
aid in converting a production code written for a traditional vector a
rchitecture to data parallel code suitable for the Thinking Machines C
orporation CM-5, Also, we describe the 3-D transposition method used t
o parallelize the spherical harmonic transforms in CCM2. This method e
mploys dynamic data mapping techniques to improve data locality and pa
rallel efficiency of these computations. We present performance data f
or the 3-D transposition method on the CM-5 for machine size up to 512
processors. We conclude that the parallel performance of the 3-D tran
sposition method is adversely affected on the CM-5 by short vector len
gths and array padding. We also find that the CM-5 spherical harmonic
transforms spend about 70% of their execution time in communication. W
e detail a transposition-based data parallel implementation of the sem
i-Lagrangian Transport (SLT) algorithm used in CCM2. We analyze two ap
proaches to parallelizing the SLT, called the departure point and arri
val point based methods. We develop a performance model for choosing b
etween these methods. We present SLT performance data which shows that
the localized horizontal interpolation in the SLT takes 70% of the ti
me, while the data remapping itself only require approximately 16%. We
discuss the importance of scalable I/O to CCM2, and present the I/O r
ates measured on the CM-5. We compare the performance of the data para
llel version of CCM2 on a 32-processor CM-5 with the optimized vector
code running on a single processor Gray Y-MP. We show that the CM-5 co
de is 75% faster. We also give the overall performance of CCM2 running
at higher resolutions on different numbers of CM-5 processors. We con
clude by discussing the significance of these results and their implic
ations for data parallel climate models.