[FLASH-USERS] MPI hangs in send_block_data

Yi-Hao Chen ychen at astro.wisc.edu
Thu Jul 11 13:30:37 EDT 2019


Dear All,

I am having an MPI deadlock happening right after the restart of a simulation. It happens after the initialization and in the evolution stage when refinement occurs. Few of my simulations ran into the same problem. It seems to be reproducible. However, if I use a different number of MPI tasks, sometimes it can go through the deadlock.

I am using FLASH4.5 with AMR and USM on stampede2 using modules intel/18.0.2 and impi/18.0.2.<http://18.0.2.>

If you have any suggestions or possible directions to look into, please let me know.  Some details are described below.

Thank you,
Yi-Hao


The last few lines in the log file are
 ==============================================================================
 [ 07-02-2019  23:07:04.890 ] [gr_initGeometry] checking BCs for idir: 1
 [ 07-02-2019  23:07:04.891 ] [gr_initGeometry] checking BCs for idir: 2
 [ 07-02-2019  23:07:04.892 ] [gr_initGeometry] checking BCs for idir: 3
 [ 07-02-2019  23:07:04.951 ] memory: /proc vsize    (MiB):     2475.21 (min)       2475.73 (max)       2475.21 (avg)
 [ 07-02-2019  23:07:04.952 ] memory: /proc rss      (MiB):      686.03 (min)        699.24 (max)        690.59 (avg)
 [ 07-02-2019  23:07:04.964 ] [io_readData] file opened: type=checkpoint name=Group_L430_hdf5_chk_0148
 [ 07-02-2019  23:11:04.268 ] memory: /proc vsize    (MiB):     2869.67 (min)       2928.42 (max)       2869.76 (avg)
 [ 07-02-2019  23:11:04.303 ] memory: /proc rss      (MiB):     1080.69 (min)       1102.95 (max)       1085.31 (avg)
 [ 07-02-2019  23:11:04.436 ] [GRID amr_refine_derefine]: initiating refinement
 [ 07-02-2019  23:11:04.454 ] [GRID amr_refine_derefine]: redist. phase.  tot blks requested: 177882
 [GRID amr_refine_derefine] min blks 230    max blks 235    tot blks 177882
 [GRID amr_refine_derefine] min leaf blks 199    max leaf blks 205    tot leaf b
 lks 155647
 [ 07-02-2019  23:11:04.730 ] [GRID amr_refine_derefine]: refinement complete
 INFO: Grid_fillGuardCells is ignoring masking.
 [Hydro_init] MHD: hy_fullRiemannStateArrays and hy_fullSpecMsFluxHandling are b
 oth turned on!
 [ 07-02-2019  23:11:07.111 ] memory: /proc vsize    (MiB):     2858.31 (min)       2961.38 (max)       2885.72 (avg)
 [ 07-02-2019  23:11:07.112 ] memory: /proc rss      (MiB):     1090.02 (min)       1444.63 (max)       1121.47 (avg)
 [ 07-02-2019  23:11:08.532 ] [Particles_getGlobalNum]: Number of particles now: 18389431
 [ 07-02-2019  23:11:08.535 ] [IO_writePlotfile] open: type=plotfile name=Group_L430_forced_hdf5_plt_cnt_0000
 [ 07-02-2019  23:11:18.449 ] [IO_writePlotfile] close: type=plotfile name=Group_L430_forced_hdf5_plt_cnt_0000
 [ 07-02-2019  23:11:18.450 ] memory: /proc vsize    (MiB):     2857.45 (min)       2977.69 (max)       2885.71 (avg)
 [ 07-02-2019  23:11:18.453 ] memory: /proc rss      (MiB):     1095.74 (min)       1450.62 (max)       1126.72 (avg)
 [ 07-02-2019  23:11:18.454 ] [Driver_evolveFlash]: Entering evolution loop
 [ 07-02-2019  23:11:18.454 ] step: n=336100 t=7.805991E+14 dt=2.409380E+09
 [ 07-02-2019  23:11:23.199 ] [hy_uhd_unsplit]: gcNeed(MAGI_FACE_VAR,MAG_FACE_VAR) - FACES
 [ 07-02-2019  23:11:27.853 ] [Particles_getGlobalNum]: Number of particles now: 18389491
 [ 07-02-2019  23:11:28.830 ] [GRID amr_refine_derefine]: initiating refinement
 [ 07-02-2019  23:11:28.932 ] [GRID amr_refine_derefine]: redist. phase.  tot blks requested: 172522

The last few lines in the output are

 MaterialProperties initialized
 [io_readData] Opening Group_L430_hdf5_chk_0148 for restart
    Progress read 'gsurr_blks' dataset - applying pm4dev optimization.
 Source terms initialized
  iteration, no. not moved =            0      175264
  iteration, no. not moved =            1        8384
  iteration, no. not moved =            2           0
 refined: total leaf blocks =       155647
 refined: total blocks =       177882
 INFO: Grid_fillGuardCells is ignoring masking.
  Finished with Grid_initDomain, restart
 Ready to call Hydro_init
 [Hydro_init] NOTE: hy_fullRiemannStateArrays and hy_fullSpecMsFluxHandling are
 both true for MHD!
 Hydro initialized
 Gravity initialized
 Initial dt verified
 *** Wrote plotfile to Group_L430_forced_hdf5_plt_cnt_0000 ****
 Initial plotfile written
 Driver init all done
  iteration, no. not moved =            0      165494
slurmstepd: error: *** JOB 3889364 ON c476-064 CANCELLED AT 2019-07-02T23:54:57 ***

I have found that the particular MPI function it hangs is at line 360 in send_block_data.F90, but I am not sure how to further debug the problem.
359                 If (nvar > 0) Then
360                 Call MPI_SSEND (unk(1,is_unk,js_unk,ks_unk,lb),        &
361                                 1,                                     &
362                                 unk_int_type,                          &
363                                 new_loc(2,lb),                         &
364                                 new_loc(1,lb),                         &
365                                 amr_mpi_meshComm,                        &
366                                 ierr)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20190711/caaeea4b/attachment.htm>


More information about the flash-users mailing list