<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<div dir="ltr">
<div>Dear All,</div>
<div><br>
</div>
<div>I am having an MPI deadlock happening right after the restart of a simulation. It happens after the initialization and in the evolution stage when refinement occurs. Few of my simulations ran into the same problem. It seems to be reproducible. However,
if I<span class="gmail-gr_ gmail-gr_197 gmail-gr-alert gmail-gr_tiny gmail-gr_spell gmail-gr_inline_cards gmail-gr_run_anim gmail-ContextualSpelling gmail-multiReplace" id="gmail-197"></span> use a different number of MPI tasks, sometimes it can go through
the deadlock.</div>
<div><br>
</div>
<div>I am using FLASH4.5 with AMR and USM on stampede2 using modules intel/18.0.2 and impi/<a href="http://18.0.2.">18.0.2.</a></div>
<div><br>
</div>
<div>If you have any suggestions or possible directions to look into, please let me know. Some details are described below.<br>
<div>
<div><br>
</div>
<div>Thank you,</div>
<div>Yi-Hao<br>
</div>
<div></div>
<div><br>
</div>
</div>
</div>
<br>
<div>The last few lines in the log file are<br>
</div>
<div><font size="1"></font></div>
<div><font size="1"><span style="font-family:courier new,monospace"> ==============================================================================<br>
[ 07-02-2019 23:07:04.890 ] [gr_initGeometry] checking BCs for idir: 1<br>
[ 07-02-2019 23:07:04.891 ] [gr_initGeometry] checking BCs for idir: 2<br>
[ 07-02-2019 23:07:04.892 ] [gr_initGeometry] checking BCs for idir: 3<br>
[ 07-02-2019 23:07:04.951 ] memory: /proc vsize (MiB): 2475.21 (min) 2475.73 (max) 2475.21 (avg)<br>
[ 07-02-2019 23:07:04.952 ] memory: /proc rss (MiB): 686.03 (min) 699.24 (max) 690.59 (avg)<br>
[ 07-02-2019 23:07:04.964 ] [io_readData] file opened: type=checkpoint name=Group_L430_hdf5_chk_0148<br>
[ 07-02-2019 23:11:04.268 ] memory: /proc vsize (MiB): 2869.67 (min) 2928.42 (max) 2869.76 (avg)<br>
[ 07-02-2019 23:11:04.303 ] memory: /proc rss (MiB): 1080.69 (min) 1102.95 (max) 1085.31 (avg)<br>
[ 07-02-2019 23:11:04.436 ] [GRID amr_refine_derefine]: initiating refinement<br>
[ 07-02-2019 23:11:04.454 ] [GRID amr_refine_derefine]: redist. phase. tot blks requested: 177882<br>
[GRID amr_refine_derefine] min blks 230 max blks 235 tot blks 177882<br>
[GRID amr_refine_derefine] min leaf blks 199 max leaf blks 205 tot leaf b<br>
lks 155647<br>
[ 07-02-2019 23:11:04.730 ] [GRID amr_refine_derefine]: refinement complete<br>
INFO: Grid_fillGuardCells is ignoring masking.<br>
[Hydro_init] MHD: hy_fullRiemannStateArrays and hy_fullSpecMsFluxHandling are b<br>
oth turned on!<br>
[ 07-02-2019 23:11:07.111 ] memory: /proc vsize (MiB): 2858.31 (min) 2961.38 (max) 2885.72 (avg)<br>
[ 07-02-2019 23:11:07.112 ] memory: /proc rss (MiB): 1090.02 (min) 1444.63 (max) 1121.47 (avg)<br>
[ 07-02-2019 23:11:08.532 ] [Particles_getGlobalNum]: Number of particles now: 18389431<br>
[ 07-02-2019 23:11:08.535 ] [IO_writePlotfile] open: type=plotfile name=Group_L430_forced_hdf5_plt_cnt_0000<br>
[ 07-02-2019 23:11:18.449 ] [IO_writePlotfile] close: type=plotfile name=Group_L430_forced_hdf5_plt_cnt_0000<br>
[ 07-02-2019 23:11:18.450 ] memory: /proc vsize (MiB): 2857.45 (min) 2977.69 (max) 2885.71 (avg)<br>
[ 07-02-2019 23:11:18.453 ] memory: /proc rss (MiB): 1095.74 (min) 1450.62 (max) 1126.72 (avg)<br>
[ 07-02-2019 23:11:18.454 ] [Driver_evolveFlash]: Entering evolution loop<br>
[ 07-02-2019 23:11:18.454 ] step: n=336100 t=7.805991E+14 dt=2.409380E+09<br>
[ 07-02-2019 23:11:23.199 ] [hy_uhd_unsplit]: gcNeed(MAGI_FACE_VAR,MAG_FACE_VAR) - FACES<br>
[ 07-02-2019 23:11:27.853 ] [Particles_getGlobalNum]: Number of particles now: 18389491<br>
[ 07-02-2019 23:11:28.830 ] [GRID amr_refine_derefine]: initiating refinement<br>
[ 07-02-2019 23:11:28.932 ] [GRID amr_refine_derefine]: redist. phase. tot blks requested: 172522<br>
</span></font></div>
<div><br>
</div>
<div>The last few lines in the output are</div>
<div><br>
</div>
<div><font size="1"><span style="font-family:courier new,monospace"> MaterialProperties initialized<br>
[io_readData] Opening Group_L430_hdf5_chk_0148 for restart<br>
Progress read 'gsurr_blks' dataset - applying pm4dev optimization.<br>
Source terms initialized<br>
iteration, no. not moved = 0 175264<br>
iteration, no. not moved = 1 8384<br>
iteration, no. not moved = 2 0<br>
refined: total leaf blocks = 155647<br>
refined: total blocks = 177882<br>
INFO: Grid_fillGuardCells is ignoring masking.<br>
Finished with Grid_initDomain, restart<br>
Ready to call Hydro_init<br>
[Hydro_init] NOTE: hy_fullRiemannStateArrays and hy_fullSpecMsFluxHandling are<br>
both true for MHD!<br>
Hydro initialized<br>
Gravity initialized<br>
Initial dt verified<br>
*** Wrote plotfile to Group_L430_forced_hdf5_plt_cnt_0000 ****<br>
Initial plotfile written<br>
Driver init all done<br>
iteration, no. not moved = 0 165494</span></font></div>
<div><font size="1"><span style="font-family:courier new,monospace">slurmstepd: error: *** JOB 3889364 ON c476-064 CANCELLED AT 2019-07-02T23:54:57 ***</span></font></div>
<div>
<div><br>
</div>
<div>I have found that the particular MPI function it hangs is at line 360 in <span style="font-family:courier new,monospace">
send_block_data.F90<span style="font-family:arial,sans-serif">, but I am not sure how to further debug the problem.</span></span><br>
</div>
<div>359 If (nvar > 0) Then<br>
360 Call MPI_SSEND (unk(1,is_unk,js_unk,ks_unk,lb), & <br>
361 1, & <br>
362 unk_int_type, & <br>
363 new_loc(2,lb), & <br>
364 new_loc(1,lb), & <br>
365 amr_mpi_meshComm, & <br>
366 ierr)<br>
</div>
<div><br>
</div>
</div>
</div>
</body>
</html>