[FLASH-USERS] MPI hangs in send_block_data

Wed Jul 17 19:38:52 EDT 2019

Hi Yi-Hao,

I'm not sure if it is helpful, but it is only hinted at in the 
references that you cite, so I figured I'd say it...

I have, in the past, experienced some issues like this when pushing 
close to the amount of available memory.  It appears to be possible for 
paramesh to fail in odd ways when memory is tight. i.e. when there 
"should" be enough memory, but with only a modest amount of margin.  
Sometimes just running on more processors is enough, but sometimes a 
decrease in maxblocks is also required. You may have tried this, but I 
figured I would mention it.

It does seem like if the recv is succeeding but the corresponding ssend 
is just hanging indefinitely, you have confirmed that this is not a bug 
in flash.  I believe that behavior violates the MPI API.

As to the question of whether or not synchronous sends should be used or 
not, I think that is probably a deep discussion about portability and 
error reproducibility for paramesh that I'm certainly not qualified to 
get into.  From what I can tell from MPI specs, this is not just about 
whether synchronous or non-synchronous is algorithmically necessary, but 
about management of message buffering.  Nominally use of ssend is a 
request for the system to not buffer.  This forestalls (possibly vast) 
differences between the buffering behavior of different systems.

Good luck!

Dean

On 7/16/19 6:34 PM, Yi-Hao Chen wrote:
> Dear all,
>
> This is an update on this issue. I have been in conversation with TACC 
> staff and one of the concerns is the use of MPI_SSEND. What would be 
> the motivation of using MPI_SSEND here as oppose to MPI_SEND?
>
> I used DDT and can see that only one process is hanging at the 
> MPI_SSEND while the rest are waiting at MPI_ALLREDUCE in 
> mpi_amr_redist_blk. The screenshot from DDT is attached. I tried to 
> print out the corresponding irecv calls and did see the matching irecv 
> was executed. Unless there is an additional send that has the same 
> (to, from, block#, tag), I cannot see the reason that the send call hangs.
>
> I've searched the FLASH-USERS mailing list and saw that similar 
> problems were brought up for a few times[^a][^b][^c], but without a 
> definite solution. The problem seems to be associated with the intel 
> MPI. A possible solution mentioned was to use mvapich2 rather than 
> impi. But when I compiled FLASH with mvapich2, it hangs right at 
> reading the checkpoint file.
>
> [^a]: [FLASH-USERS] mpi_amr_redist_blk has some processors hang at 
> nrecv waitall
> http://flash.uchicago.edu/pipermail/flash-users/2018-June/002653.html
>
> [^b]: [FLASH-USERS] MPI deadlock in block move after refinement
> http://flash.uchicago.edu/pipermail/flash-users/2017-September/002402.html
>
> [^c]: [FLASH-USERS] FLASH crashing: "iteration, no. not moved"
> http://flash.uchicago.edu/pipermail/flash-users/2017-March/002219.html
>
>
> Another thought is regarding the use of MPI_IRECV and MPI_ALLREDUCE. 
> The code is structured as
>
> ~~~~
> MPI_IRECV
> while (repeat)
> MPI_SSEND
> MPI_ALLREDUCE
>
> MPI_WAITALL
> ~~~~
>
> I am wondering if the MPI_ALLREDUCE call in the receiving process 
> could prevent the previous MPI_IRECV to receive the data. I suspect 
> this might be the reason that the MPI_SSEND hangs. However, if this is 
> the case, the problem should probably happen more frequently.
>
> This part of the code is from Paramesh4dev. The send_block_data was 
> separated from mpi_amr_redist_blk in Paramesh4.0. Although I did not 
> see a big difference, I am not sure if there are any significant 
> changes between Paramesh4.0 and Paramesh4dev.
>
> I would appreciate any thought you might have.
>
> Thank you,
> Yi-Hao
>
>
> On Thu, Jul 11, 2019 at 12:30 PM Yi-Hao Chen <ychen at astro.wisc.edu 
> <mailto:ychen at astro.wisc.edu>> wrote:
>
>     Dear All,
>
>     I am having an MPI deadlock happening right after the restart of a
>     simulation. It happens after the initialization and in the
>     evolution stage when refinement occurs. Few of my simulations ran
>     into the same problem. It seems to be reproducible. However, if
>     Iuse a different number of MPI tasks, sometimes it can go through
>     the deadlock.
>
>     I am using FLASH4.5 with AMR and USM on stampede2 using modules
>     intel/18.0.2 and impi/18.0.2. <http://18.0.2.>
>
>     If you have any suggestions or possible directions to look into,
>     please let me know.  Some details are described below.
>
>     Thank you,
>     Yi-Hao
>
>
>     The last few lines in the log file are
>      ==============================================================================
>      [ 07-02-2019  23:07:04.890 ] [gr_initGeometry] checking BCs for
>     idir: 1
>      [ 07-02-2019  23:07:04.891 ] [gr_initGeometry] checking BCs for
>     idir: 2
>      [ 07-02-2019  23:07:04.892 ] [gr_initGeometry] checking BCs for
>     idir: 3
>      [ 07-02-2019  23:07:04.951 ] memory: /proc vsize    (MiB):    
>     2475.21 (min)       2475.73 (max)     2475.21 (avg)
>      [ 07-02-2019  23:07:04.952 ] memory: /proc rss    (MiB):    
>      686.03 (min)        699.24 (max)      690.59 (avg)
>      [ 07-02-2019  23:07:04.964 ] [io_readData] file opened:
>     type=checkpoint name=Group_L430_hdf5_chk_0148
>      [ 07-02-2019  23:11:04.268 ] memory: /proc vsize    (MiB):    
>     2869.67 (min)       2928.42 (max)     2869.76 (avg)
>      [ 07-02-2019  23:11:04.303 ] memory: /proc rss    (MiB):    
>     1080.69 (min)       1102.95 (max)     1085.31 (avg)
>      [ 07-02-2019  23:11:04.436 ] [GRID amr_refine_derefine]:
>     initiating refinement
>      [ 07-02-2019  23:11:04.454 ] [GRID amr_refine_derefine]: redist.
>     phase.  tot blks requested: 177882
>      [GRID amr_refine_derefine] min blks 230    max blks 235    tot
>     blks 177882
>      [GRID amr_refine_derefine] min leaf blks 199  max leaf blks 205  
>      tot leaf b
>      lks 155647
>      [ 07-02-2019  23:11:04.730 ] [GRID amr_refine_derefine]:
>     refinement complete
>      INFO: Grid_fillGuardCells is ignoring masking.
>      [Hydro_init] MHD: hy_fullRiemannStateArrays and
>     hy_fullSpecMsFluxHandling are b
>      oth turned on!
>      [ 07-02-2019  23:11:07.111 ] memory: /proc vsize    (MiB):    
>     2858.31 (min)       2961.38 (max)     2885.72 (avg)
>      [ 07-02-2019  23:11:07.112 ] memory: /proc rss    (MiB):    
>     1090.02 (min)       1444.63 (max)     1121.47 (avg)
>      [ 07-02-2019  23:11:08.532 ] [Particles_getGlobalNum]: Number of
>     particles now: 18389431
>      [ 07-02-2019  23:11:08.535 ] [IO_writePlotfile] open:
>     type=plotfile name=Group_L430_forced_hdf5_plt_cnt_0000
>      [ 07-02-2019  23:11:18.449 ] [IO_writePlotfile] close:
>     type=plotfile name=Group_L430_forced_hdf5_plt_cnt_0000
>      [ 07-02-2019  23:11:18.450 ] memory: /proc vsize    (MiB):    
>     2857.45 (min)       2977.69 (max)     2885.71 (avg)
>      [ 07-02-2019  23:11:18.453 ] memory: /proc rss    (MiB):    
>     1095.74 (min)       1450.62 (max)     1126.72 (avg)
>      [ 07-02-2019  23:11:18.454 ] [Driver_evolveFlash]: Entering
>     evolution loop
>      [ 07-02-2019  23:11:18.454 ] step: n=336100 t=7.805991E+14
>     dt=2.409380E+09
>      [ 07-02-2019  23:11:23.199 ] [hy_uhd_unsplit]:
>     gcNeed(MAGI_FACE_VAR,MAG_FACE_VAR) - FACES
>      [ 07-02-2019  23:11:27.853 ] [Particles_getGlobalNum]: Number of
>     particles now: 18389491
>      [ 07-02-2019  23:11:28.830 ] [GRID amr_refine_derefine]:
>     initiating refinement
>      [ 07-02-2019  23:11:28.932 ] [GRID amr_refine_derefine]: redist.
>     phase.  tot blks requested: 172522
>
>     The last few lines in the output are
>
>      MaterialProperties initialized
>      [io_readData] Opening Group_L430_hdf5_chk_0148 for restart
>         Progress read 'gsurr_blks' dataset - applying pm4dev optimization.
>      Source terms initialized
>       iteration, no. not moved =            0  175264
>       iteration, no. not moved =            1  8384
>       iteration, no. not moved =            2   0
>      refined: total leaf blocks =       155647
>      refined: total blocks =       177882
>      INFO: Grid_fillGuardCells is ignoring masking.
>       Finished with Grid_initDomain, restart
>      Ready to call Hydro_init
>      [Hydro_init] NOTE: hy_fullRiemannStateArrays and
>     hy_fullSpecMsFluxHandling are
>      both true for MHD!
>      Hydro initialized
>      Gravity initialized
>      Initial dt verified
>      *** Wrote plotfile to Group_L430_forced_hdf5_plt_cnt_0000 ****
>      Initial plotfile written
>      Driver init all done
>       iteration, no. not moved =            0  165494
>     slurmstepd: error: *** JOB 3889364 ON c476-064 CANCELLED AT
>     2019-07-02T23:54:57 ***
>
>     I have found that the particular MPI function it hangs is at line
>     360 in send_block_data.F90, but I am not sure how to further debug
>     the problem.
>     359                 If (nvar > 0) Then
>     360                 Call MPI_SSEND
>     (unk(1,is_unk,js_unk,ks_unk,lb),        &
>     361                                 1,                       &
>     362                                 unk_int_type,                
>            &
>     363                                 new_loc(2,lb),                
>           &
>     364                                 new_loc(1,lb),                
>           &
>     365 amr_mpi_meshComm,                        &
>     366                                 ierr)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20190717/7de39058/attachment-0001.htm>