[FLASH-USERS] MPI hangs in send_block_data

Thu Jul 18 12:51:57 EDT 2019

Hi Dean,

Thanks for the tips. I did try reducing the maxblocks.

For reference, I have been running simulations on stampede2 skylake nodes with maxblocks set to 800. The actual number of blocks per process has mostly been between 200 and 600. From top outputs on the computing nodes, the memory usage is usually less than 50%. To be specific, I use 48 MPI tasks per node and each task uses 1~2 GB of ram, while each skylake node has a total of 192 GB ram. It would not make sense to me if this was caused by tight memory. For debugging runs, I need to set a larger maxblocks=1200 for using fewer nodes.

That being said, I did have some cases in which reducing the maxblocks helped. I tried maxblocks 500 and 300 and two of the simulations can get through the initial deadlock. But there were also a few that did not work. My current strategy is trial-and-error, changing the number of nodes or maxblocks, and hoping it will work eventually.

On the other hand, if I understand correctly, a smaller maxblocks will make the mpi_amr_redist_blk routine going through more iterations due to less allocated memory space available to store the receiving data. Therefore there will be fewer point-to-point communications happening at the same time. Maybe that helps if this was a communication problem.

Thank you for your insight. I really hope we can find out the root cause of this problem soon.

Best,
Yi-Hao

On Wed, Jul 17, 2019 at 6:38 PM Dean Townsley <Dean.M.Townsley at ua.edu<mailto:Dean.M.Townsley at ua.edu>> wrote:

Hi Yi-Hao,

I'm not sure if it is helpful, but it is only hinted at in the references that you cite, so I figured I'd say it...

I have, in the past, experienced some issues like this when pushing close to the amount of available memory.  It appears to be possible for paramesh to fail in odd ways when memory is tight.  i.e. when there "should" be enough memory, but with only a modest amount of margin.  Sometimes just running on more processors is enough, but sometimes a decrease in maxblocks is also required.  You may have tried this, but I figured I would mention it.

It does seem like if the recv is succeeding but the corresponding ssend is just hanging indefinitely, you have confirmed that this is not a bug in flash.  I believe that behavior violates the MPI API.

As to the question of whether or not synchronous sends should be used or not, I think that is probably a deep discussion about portability and error reproducibility for paramesh that I'm certainly not qualified to get into.  From what I can tell from MPI specs, this is not just about whether synchronous or non-synchronous is algorithmically necessary, but about management of message buffering.  Nominally use of ssend is a request for the system to not buffer.  This forestalls (possibly vast) differences between the buffering behavior of different systems.

Good luck!

Dean

On 7/16/19 6:34 PM, Yi-Hao Chen wrote:
Dear all,

This is an update on this issue. I have been in conversation with TACC staff and one of the concerns is the use of MPI_SSEND. What would be the motivation of using MPI_SSEND here as oppose to MPI_SEND?

I used DDT and can see that only one process is hanging at the MPI_SSEND while the rest are waiting at MPI_ALLREDUCE in mpi_amr_redist_blk. The screenshot from DDT is attached. I tried to print out the corresponding irecv calls and did see the matching irecv was executed. Unless there is an additional send that has the same (to, from, block#, tag), I cannot see the reason that the send call hangs.

I've searched the FLASH-USERS mailing list and saw that similar problems were brought up for a few times[^a][^b][^c], but without a definite solution. The problem seems to be associated with the intel MPI. A possible solution mentioned was to use mvapich2 rather than impi. But when I compiled FLASH with mvapich2, it hangs right at reading the checkpoint file.

[^a]: [FLASH-USERS] mpi_amr_redist_blk has some processors hang at nrecv waitall
http://flash.uchicago.edu/pipermail/flash-users/2018-June/002653.html

[^b]: [FLASH-USERS] MPI deadlock in block move after refinement
http://flash.uchicago.edu/pipermail/flash-users/2017-September/002402.html

[^c]: [FLASH-USERS] FLASH crashing: "iteration, no. not moved"
http://flash.uchicago.edu/pipermail/flash-users/2017-March/002219.html

Another thought is regarding the use of MPI_IRECV and MPI_ALLREDUCE. The code is structured as

~~~~
MPI_IRECV
while (repeat)
    MPI_SSEND
    MPI_ALLREDUCE

MPI_WAITALL
~~~~

I am wondering if the MPI_ALLREDUCE call in the receiving process could prevent the previous MPI_IRECV to receive the data. I suspect this might be the reason that the MPI_SSEND hangs. However, if this is the case, the problem should probably happen more frequently.

This part of the code is from Paramesh4dev. The send_block_data was separated from mpi_amr_redist_blk in Paramesh4.0. Although I did not see a big difference, I am not sure if there are any significant changes between Paramesh4.0 and Paramesh4dev.

I would appreciate any thought you might have.

Thank you,
Yi-Hao

On Thu, Jul 11, 2019 at 12:30 PM Yi-Hao Chen <ychen at astro.wisc.edu<mailto:ychen at astro.wisc.edu>> wrote:
Dear All,

I am having an MPI deadlock happening right after the restart of a simulation. It happens after the initialization and in the evolution stage when refinement occurs. Few of my simulations ran into the same problem. It seems to be reproducible. However, if I use a different number of MPI tasks, sometimes it can go through the deadlock.

I am using FLASH4.5 with AMR and USM on stampede2 using modules intel/18.0.2 and impi/18.0.2.<http://18.0.2.>

If you have any suggestions or possible directions to look into, please let me know.  Some details are described below.

Thank you,
Yi-Hao

The last few lines in the log file are
 ==============================================================================
 [ 07-02-2019  23:07:04.890 ] [gr_initGeometry] checking BCs for idir: 1
 [ 07-02-2019  23:07:04.891 ] [gr_initGeometry] checking BCs for idir: 2
 [ 07-02-2019  23:07:04.892 ] [gr_initGeometry] checking BCs for idir: 3
 [ 07-02-2019  23:07:04.951 ] memory: /proc vsize    (MiB):     2475.21 (min)       2475.73 (max)       2475.21 (avg)
 [ 07-02-2019  23:07:04.952 ] memory: /proc rss      (MiB):      686.03 (min)        699.24 (max)        690.59 (avg)
 [ 07-02-2019  23:07:04.964 ] [io_readData] file opened: type=checkpoint name=Group_L430_hdf5_chk_0148
 [ 07-02-2019  23:11:04.268 ] memory: /proc vsize    (MiB):     2869.67 (min)       2928.42 (max)       2869.76 (avg)
 [ 07-02-2019  23:11:04.303 ] memory: /proc rss      (MiB):     1080.69 (min)       1102.95 (max)       1085.31 (avg)
 [ 07-02-2019  23:11:04.436 ] [GRID amr_refine_derefine]: initiating refinement
 [ 07-02-2019  23:11:04.454 ] [GRID amr_refine_derefine]: redist. phase.  tot blks requested: 177882
 [GRID amr_refine_derefine] min blks 230    max blks 235    tot blks 177882
 [GRID amr_refine_derefine] min leaf blks 199    max leaf blks 205    tot leaf b
 lks 155647
 [ 07-02-2019  23:11:04.730 ] [GRID amr_refine_derefine]: refinement complete
 INFO: Grid_fillGuardCells is ignoring masking.
 [Hydro_init] MHD: hy_fullRiemannStateArrays and hy_fullSpecMsFluxHandling are b
 oth turned on!
 [ 07-02-2019  23:11:07.111 ] memory: /proc vsize    (MiB):     2858.31 (min)       2961.38 (max)       2885.72 (avg)
 [ 07-02-2019  23:11:07.112 ] memory: /proc rss      (MiB):     1090.02 (min)       1444.63 (max)       1121.47 (avg)
 [ 07-02-2019  23:11:08.532 ] [Particles_getGlobalNum]: Number of particles now: 18389431
 [ 07-02-2019  23:11:08.535 ] [IO_writePlotfile] open: type=plotfile name=Group_L430_forced_hdf5_plt_cnt_0000
 [ 07-02-2019  23:11:18.449 ] [IO_writePlotfile] close: type=plotfile name=Group_L430_forced_hdf5_plt_cnt_0000
 [ 07-02-2019  23:11:18.450 ] memory: /proc vsize    (MiB):     2857.45 (min)       2977.69 (max)       2885.71 (avg)
 [ 07-02-2019  23:11:18.453 ] memory: /proc rss      (MiB):     1095.74 (min)       1450.62 (max)       1126.72 (avg)
 [ 07-02-2019  23:11:18.454 ] [Driver_evolveFlash]: Entering evolution loop
 [ 07-02-2019  23:11:18.454 ] step: n=336100 t=7.805991E+14 dt=2.409380E+09
 [ 07-02-2019  23:11:23.199 ] [hy_uhd_unsplit]: gcNeed(MAGI_FACE_VAR,MAG_FACE_VAR) - FACES
 [ 07-02-2019  23:11:27.853 ] [Particles_getGlobalNum]: Number of particles now: 18389491
 [ 07-02-2019  23:11:28.830 ] [GRID amr_refine_derefine]: initiating refinement
 [ 07-02-2019  23:11:28.932 ] [GRID amr_refine_derefine]: redist. phase.  tot blks requested: 172522

The last few lines in the output are

 MaterialProperties initialized
 [io_readData] Opening Group_L430_hdf5_chk_0148 for restart
    Progress read 'gsurr_blks' dataset - applying pm4dev optimization.
 Source terms initialized
  iteration, no. not moved =            0      175264
  iteration, no. not moved =            1        8384
  iteration, no. not moved =            2           0
 refined: total leaf blocks =       155647
 refined: total blocks =       177882
 INFO: Grid_fillGuardCells is ignoring masking.
  Finished with Grid_initDomain, restart
 Ready to call Hydro_init
 [Hydro_init] NOTE: hy_fullRiemannStateArrays and hy_fullSpecMsFluxHandling are
 both true for MHD!
 Hydro initialized
 Gravity initialized
 Initial dt verified
 *** Wrote plotfile to Group_L430_forced_hdf5_plt_cnt_0000 ****
 Initial plotfile written
 Driver init all done
  iteration, no. not moved =            0      165494
slurmstepd: error: *** JOB 3889364 ON c476-064 CANCELLED AT 2019-07-02T23:54:57 ***

I have found that the particular MPI function it hangs is at line 360 in send_block_data.F90, but I am not sure how to further debug the problem.
359                 If (nvar > 0) Then
360                 Call MPI_SSEND (unk(1,is_unk,js_unk,ks_unk,lb),        &
361                                 1,                                     &
362                                 unk_int_type,                          &
363                                 new_loc(2,lb),                         &
364                                 new_loc(1,lb),                         &
365                                 amr_mpi_meshComm,                        &
366                                 ierr)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20190718/5d4d08f3/attachment-0001.htm>