EFDC+ v10.2 MPI – Benchmark Test
To assess the performance of the Message Passing Interface (MPI) implementation with Environmental Fluid Dynamics Code Plus (EFDC+), a detailed model of the Chesapeake Bay was created (see Figure 1). The Chesapeake Bay watershed is located on the east coast of the United States and drains over 166,530 sq. km., spread over five states. The bay itself has a surface area of more than 14,700 sq. km. For more details about Chesapeake Bay, its drainage area, and other related models, please visit the Chesapeake Community Modeling Program Website.
The hydrodynamic model of the Chesapeake Bay used for the benchmark test had about 204,000 curvilinear cells with an average dimension of 256 m by 270 m. The model had four vertical layers and resulted in more than 800,000 computational cells. The model simulated salinity and temperature, along with hydrodynamic processes. The model simulation time period was 2011-07-01 to 2011-07-05. The dynamic time stepping option was employed with a minimum time step as 0.05 seconds and maximum time step as 10 seconds.
To compare the Open Multi-Processing (OMP) and MPI implementations of EFDC+, the Chesapeake Bay model was run with both implementations on the same computer, using equivalent computational resources. A comparison of the model run times as a function of the number of cores utilized is shown in Figure 2. These reported run times include writing the full model output to the disk at hourly time step. This is more or less typical output frequency, which may give users a sense of the model performance one could expect when using EFDC+ for everyday analysis.
As evident in Figure 2, the MPI version of EFDC+ is faster than the OMP version in all the cases analyzed. (Note: Both the MPI and OMP versions of EFDC+ offer significant run time improvements compared to running with a single core, which took about 3.2 hours to complete.) An additional trend is observed in that the OMP version improves (decreases) the run times with additional cores, but the improvements are relatively minor, especially when increasing beyond 16 cores. However with the MPI version, there is a noticeable ‘speedup’ when the number of cores is increased. This fact is better highlighted by comparing the improvements in run time with running on a single core, which is the expected performance of legacy versions of EFDC not developed by DSI.
For the same model runs shown in Figure 2, a similar plot is produced in Figure 3, except in each case the speedup is reported. The speedup S, is defined as
S=\frac{T_1}{T_{C}}
where T_1 is the model run time spent using a single core and T_C is the time spent in by a model run with a C number of cores.
The OMP version’s run time improvement is only about 2.5 times faster than a single core run and increases only slightly with the number of cores used, whereas the MPI version makes run times up to about 17 times faster, increasing substantially with the number of cores used (Figure 3). The run time improvements with MPI do not increase linearly with the number of cores due to increased communication synchronization between processes and Amdahl’s Law. For those uninitiated to the world of parallel computing, Amdahl’s Law says that for a large number of cores, C, the speedup does not scale with C but rather will asymptote to 1/w, where w is the fraction of computation not run in parallel (Amdahl, 2013). Overall, the MPI implementation offers a significant improvement in run time compared to the OMP versions of EFDC+.
Frequently, MPI implementations of this nature require user pre-processing to specify the sub-domain sizes. However, all the sub-domain specifications for runs shown here were automatically generated using the EFDC_Explorer. The only additional step required by a user is to specify the maximum number of sub-domains to generate prior to executing EFDC+. This is particularly useful for models such as the Chesapeake Bay, which contains a large number of grid cells and thus provides a good use case for the domain decomposition approach. However, there are limits to the parallelization efficacy of the domain decomposition approach due primarily to increased communication overhead and unavoidable serialization in parts of the code.
DSI will release a white paper in the coming few weeks to describe the architecture and process behind MPI development.