<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<div dir="ltr">Hi Dean,
<div><br>
</div>
<div>Thanks for the tips. I did try reducing the maxblocks. </div>
<div><br>
</div>
<div>For reference, I have been running simulations on stampede2 skylake nodes with maxblocks set to 800. The actual number of blocks per process has mostly been between 200 and 600. From top outputs on the computing nodes, the memory usage is usually less
than 50%. To be specific, I use 48 MPI tasks per node and each task uses 1~2 GB of ram, while each skylake node has a total of 192 GB ram. It would not make sense to me if this was caused by tight memory. For debugging runs, I need to set a larger maxblocks=1200
for using fewer nodes.</div>
<div><br>
</div>
<div>That being said, I did have some cases in which reducing the maxblocks helped. I tried maxblocks 500 and 300 and two of the simulations can get through the initial deadlock. But there were also a few that did not work. My current strategy is trial-and-error,
changing the number of nodes or maxblocks, and hoping it will work eventually.<br>
</div>
<div><br>
</div>
<div>On the other hand, if I understand correctly, a smaller maxblocks will make the mpi_amr_redist_blk routine going through more iterations due to less allocated memory space available to store the receiving data. Therefore there will be fewer point-to-point
communications happening at the same time. Maybe that helps if this was a communication problem.</div>
<div><br>
</div>
<div>Thank you for your insight. I really hope we can find out the root cause of this problem soon.</div>
<div><br>
</div>
<div>Best,<br>
</div>
<div>Yi-Hao<br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, Jul 17, 2019 at 6:38 PM Dean Townsley <<a href="mailto:Dean.M.Townsley@ua.edu" target="_blank">Dean.M.Townsley@ua.edu</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p>Hi Yi-Hao,</p>
<p>I'm not sure if it is helpful, but it is only hinted at in the references that you cite, so I figured I'd say it...</p>
<p>I have, in the past, experienced some issues like this when pushing close to the amount of available memory. It appears to be possible for paramesh to fail in odd ways when memory is tight. i.e. when there "should" be enough memory, but with only a modest
amount of margin. Sometimes just running on more processors is enough, but sometimes a decrease in maxblocks is also required. You may have tried this, but I figured I would mention it.</p>
<p>It does seem like if the recv is succeeding but the corresponding ssend is just hanging indefinitely, you have confirmed that this is not a bug in flash. I believe that behavior violates the MPI API.</p>
<p>As to the question of whether or not synchronous sends should be used or not, I think that is probably a deep discussion about portability and error reproducibility for paramesh that I'm certainly not qualified to get into. From what I can tell from MPI
specs, this is not just about whether synchronous or non-synchronous is algorithmically necessary, but about management of message buffering. Nominally use of ssend is a request for the system to not buffer. This forestalls (possibly vast) differences between
the buffering behavior of different systems.<br>
</p>
<p>Good luck!</p>
<p>Dean<br>
</p>
<div class="gmail-m_-1831369867506453472gmail-m_-2444529041221919683moz-cite-prefix">
On 7/16/19 6:34 PM, Yi-Hao Chen wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">
<div>Dear all,</div>
<div><br>
</div>
<div>This is an update on this issue. I have been in conversation with TACC staff and one of the concerns is the use of MPI_SSEND. What would be the motivation of using MPI_SSEND here as oppose to MPI_SEND?</div>
<div><br>
</div>
<div>I used DDT and can see that only one process is hanging at the MPI_SSEND while the rest are waiting at MPI_ALLREDUCE in mpi_amr_redist_blk. The screenshot from DDT is attached. I tried to print out the corresponding irecv calls and did see the matching
irecv was executed. Unless there is an additional send that has the same (to, from, block#, tag), I cannot see the reason that the send call hangs.</div>
<div><br>
</div>
<div>
<div>I've searched the FLASH-USERS mailing list and saw that similar problems were brought up for a few times[^a][^b][^c], but without a definite solution. The problem seems to be associated with the intel MPI. A possible solution mentioned was to use mvapich2
rather than impi. But when I compiled FLASH with mvapich2, it hangs right at reading the checkpoint file.
<br>
</div>
<div><br>
</div>
<div>[^a]: [FLASH-USERS] mpi_amr_redist_blk has some processors hang at nrecv waitall</div>
<div><a href="http://flash.uchicago.edu/pipermail/flash-users/2018-June/002653.html" target="_blank">http://flash.uchicago.edu/pipermail/flash-users/2018-June/002653.html</a></div>
<div>
<div dir="ltr"><br>
</div>
<div dir="ltr">[^b]: [FLASH-USERS] MPI deadlock in block move after refinement
<div><a href="http://flash.uchicago.edu/pipermail/flash-users/2017-September/002402.html" target="_blank">http://flash.uchicago.edu/pipermail/flash-users/2017-September/002402.html</a></div>
<div><br>
</div>
</div>
</div>
[^c]: [FLASH-USERS] FLASH crashing: "iteration, no. not moved"
<div><a href="http://flash.uchicago.edu/pipermail/flash-users/2017-March/002219.html" target="_blank">http://flash.uchicago.edu/pipermail/flash-users/2017-March/002219.html</a></div>
</div>
</div>
<div><br>
</div>
<div><br>
</div>
<div>
<div>Another thought is regarding the use of MPI_IRECV and MPI_ALLREDUCE. The code is structured as
<br>
</div>
<div><br>
</div>
<div>~~~~<br>
</div>
<div><span style="font-family:"courier new",monospace">MPI_IRECV</span></div>
<div><span style="font-family:"courier new",monospace"></span></div>
<div><span style="font-family:"courier new",monospace">while (repeat)<br>
</span></div>
<div><span style="font-family:"courier new",monospace"> MPI_SSEND</span></div>
<div><span style="font-family:"courier new",monospace"></span></div>
<div><span style="font-family:"courier new",monospace"> MPI_ALLREDUCE</span></div>
<div><span style="font-family:"courier new",monospace"><br>
</span></div>
<div><span style="font-family:"courier new",monospace">MPI_WAITALL</span></div>
<div>~~~~<br>
</div>
<div><br>
</div>
<div>I am wondering if the MPI_ALLREDUCE call in the receiving process could prevent the previous MPI_IRECV to receive the data. I suspect this might be the reason that the MPI_SSEND hangs. However, if this is the case, the problem should probably happen more
frequently.</div>
<div><br>
</div>
<div>This part of the code is from Paramesh4dev. The send_block_data was separated from mpi_amr_redist_blk in Paramesh4.0. Although I did not see a big difference, I am not sure if there are any significant changes between Paramesh4.0 and Paramesh4dev.
<br>
</div>
<div><br>
</div>
<div>I would appreciate any thought you might have.<br>
</div>
<div><br>
</div>
<div>Thank you,</div>
<div>Yi-Hao<br>
</div>
</div>
<div><br>
</div>
<div><br>
</div>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Jul 11, 2019 at 12:30 PM Yi-Hao Chen <<a href="mailto:ychen@astro.wisc.edu" target="_blank">ychen@astro.wisc.edu</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div dir="ltr">
<div>Dear All,</div>
<div><br>
</div>
<div>I am having an MPI deadlock happening right after the restart of a simulation. It happens after the initialization and in the evolution stage when refinement occurs. Few of my simulations ran into the same problem. It seems to be reproducible. However,
if I<span class="gmail-m_-1831369867506453472gmail-m_-2444529041221919683m_5341148263124634759m_-3791934762761712513m_-2809840919508679901m_-3858132456204862063m_4484173492888926364m_-5201599341974555857gmail-m_6363570219312529462gmail-m_1633247020468104359gmail-m_-2445835142924874369gmail-gr_ gmail-m_-1831369867506453472gmail-m_5341148263124634759m_-3791934762761712513m_-2809840919508679901m_-3858132456204862063m_4484173492888926364m_-5201599341974555857gmail-m_6363570219312529462gmail-m_1633247020468104359gmail-m_-2445835142924874369gmail-gr_197 gmail-m_-1831369867506453472gmail-m_5341148263124634759m_-3791934762761712513m_-2809840919508679901m_-3858132456204862063m_4484173492888926364m_-5201599341974555857gmail-m_6363570219312529462gmail-m_1633247020468104359gmail-m_-2445835142924874369gmail-gr-alert gmail-m_-1831369867506453472gmail-m_5341148263124634759m_-3791934762761712513m_-2809840919508679901m_-3858132456204862063m_4484173492888926364m_-5201599341974555857gmail-m_6363570219312529462gmail-m_1633247020468104359gmail-m_-2445835142924874369gmail-gr_tiny gmail-m_-1831369867506453472gmail-m_5341148263124634759m_-3791934762761712513m_-2809840919508679901m_-3858132456204862063m_4484173492888926364m_-5201599341974555857gmail-m_6363570219312529462gmail-m_1633247020468104359gmail-m_-2445835142924874369gmail-gr_spell gmail-m_-1831369867506453472gmail-m_5341148263124634759m_-3791934762761712513m_-2809840919508679901m_-3858132456204862063m_4484173492888926364m_-5201599341974555857gmail-m_6363570219312529462gmail-m_1633247020468104359gmail-m_-2445835142924874369gmail-gr_inline_cards gmail-m_-1831369867506453472gmail-m_5341148263124634759m_-3791934762761712513m_-2809840919508679901m_-3858132456204862063m_4484173492888926364m_-5201599341974555857gmail-m_6363570219312529462gmail-m_1633247020468104359gmail-m_-2445835142924874369gmail-gr_run_anim gmail-m_-1831369867506453472gmail-m_5341148263124634759m_-3791934762761712513m_-2809840919508679901m_-3858132456204862063m_4484173492888926364m_-5201599341974555857gmail-m_6363570219312529462gmail-m_1633247020468104359gmail-m_-2445835142924874369gmail-ContextualSpelling gmail-m_-1831369867506453472gmail-m_5341148263124634759m_-3791934762761712513m_-2809840919508679901m_-3858132456204862063m_4484173492888926364m_-5201599341974555857gmail-m_6363570219312529462gmail-m_1633247020468104359gmail-m_-2445835142924874369gmail-multiReplace" id="gmail-m_-1831369867506453472gmail-m_-2444529041221919683m_5341148263124634759m_-3791934762761712513m_-2809840919508679901m_-3858132456204862063m_4484173492888926364m_-5201599341974555857gmail-m_6363570219312529462gmail-m_1633247020468104359gmail-m_-2445835142924874369gmail-197"></span>
use a different number of MPI tasks, sometimes it can go through the deadlock.</div>
<div><br>
</div>
<div>I am using FLASH4.5 with AMR and USM on stampede2 using modules intel/18.0.2 and impi/<a href="http://18.0.2." target="_blank">18.0.2.</a></div>
<div><br>
</div>
<div>If you have any suggestions or possible directions to look into, please let me know. Some details are described below.<br>
<div>
<div><br>
</div>
<div>Thank you,</div>
<div>Yi-Hao<br>
</div>
<div><br>
</div>
</div>
</div>
<br>
<div>The last few lines in the log file are<br>
</div>
<div><font size="1"><span style="font-family:"courier new",monospace"> ==============================================================================<br>
[ 07-02-2019 23:07:04.890 ] [gr_initGeometry] checking BCs for idir: 1<br>
[ 07-02-2019 23:07:04.891 ] [gr_initGeometry] checking BCs for idir: 2<br>
[ 07-02-2019 23:07:04.892 ] [gr_initGeometry] checking BCs for idir: 3<br>
[ 07-02-2019 23:07:04.951 ] memory: /proc vsize (MiB): 2475.21 (min) 2475.73 (max) 2475.21 (avg)<br>
[ 07-02-2019 23:07:04.952 ] memory: /proc rss (MiB): 686.03 (min) 699.24 (max) 690.59 (avg)<br>
[ 07-02-2019 23:07:04.964 ] [io_readData] file opened: type=checkpoint name=Group_L430_hdf5_chk_0148<br>
[ 07-02-2019 23:11:04.268 ] memory: /proc vsize (MiB): 2869.67 (min) 2928.42 (max) 2869.76 (avg)<br>
[ 07-02-2019 23:11:04.303 ] memory: /proc rss (MiB): 1080.69 (min) 1102.95 (max) 1085.31 (avg)<br>
[ 07-02-2019 23:11:04.436 ] [GRID amr_refine_derefine]: initiating refinement<br>
[ 07-02-2019 23:11:04.454 ] [GRID amr_refine_derefine]: redist. phase. tot blks requested: 177882<br>
[GRID amr_refine_derefine] min blks 230 max blks 235 tot blks 177882<br>
[GRID amr_refine_derefine] min leaf blks 199 max leaf blks 205 tot leaf b<br>
lks 155647<br>
[ 07-02-2019 23:11:04.730 ] [GRID amr_refine_derefine]: refinement complete<br>
INFO: Grid_fillGuardCells is ignoring masking.<br>
[Hydro_init] MHD: hy_fullRiemannStateArrays and hy_fullSpecMsFluxHandling are b<br>
oth turned on!<br>
[ 07-02-2019 23:11:07.111 ] memory: /proc vsize (MiB): 2858.31 (min) 2961.38 (max) 2885.72 (avg)<br>
[ 07-02-2019 23:11:07.112 ] memory: /proc rss (MiB): 1090.02 (min) 1444.63 (max) 1121.47 (avg)<br>
[ 07-02-2019 23:11:08.532 ] [Particles_getGlobalNum]: Number of particles now: 18389431<br>
[ 07-02-2019 23:11:08.535 ] [IO_writePlotfile] open: type=plotfile name=Group_L430_forced_hdf5_plt_cnt_0000<br>
[ 07-02-2019 23:11:18.449 ] [IO_writePlotfile] close: type=plotfile name=Group_L430_forced_hdf5_plt_cnt_0000<br>
[ 07-02-2019 23:11:18.450 ] memory: /proc vsize (MiB): 2857.45 (min) 2977.69 (max) 2885.71 (avg)<br>
[ 07-02-2019 23:11:18.453 ] memory: /proc rss (MiB): 1095.74 (min) 1450.62 (max) 1126.72 (avg)<br>
[ 07-02-2019 23:11:18.454 ] [Driver_evolveFlash]: Entering evolution loop<br>
[ 07-02-2019 23:11:18.454 ] step: n=336100 t=7.805991E+14 dt=2.409380E+09<br>
[ 07-02-2019 23:11:23.199 ] [hy_uhd_unsplit]: gcNeed(MAGI_FACE_VAR,MAG_FACE_VAR) - FACES<br>
[ 07-02-2019 23:11:27.853 ] [Particles_getGlobalNum]: Number of particles now: 18389491<br>
[ 07-02-2019 23:11:28.830 ] [GRID amr_refine_derefine]: initiating refinement<br>
[ 07-02-2019 23:11:28.932 ] [GRID amr_refine_derefine]: redist. phase. tot blks requested: 172522<br>
</span></font></div>
<div><br>
</div>
<div>The last few lines in the output are</div>
<div><br>
</div>
<div><font size="1"><span style="font-family:"courier new",monospace"> MaterialProperties initialized<br>
[io_readData] Opening Group_L430_hdf5_chk_0148 for restart<br>
Progress read 'gsurr_blks' dataset - applying pm4dev optimization.<br>
Source terms initialized<br>
iteration, no. not moved = 0 175264<br>
iteration, no. not moved = 1 8384<br>
iteration, no. not moved = 2 0<br>
refined: total leaf blocks = 155647<br>
refined: total blocks = 177882<br>
INFO: Grid_fillGuardCells is ignoring masking.<br>
Finished with Grid_initDomain, restart<br>
Ready to call Hydro_init<br>
[Hydro_init] NOTE: hy_fullRiemannStateArrays and hy_fullSpecMsFluxHandling are<br>
both true for MHD!<br>
Hydro initialized<br>
Gravity initialized<br>
Initial dt verified<br>
*** Wrote plotfile to Group_L430_forced_hdf5_plt_cnt_0000 ****<br>
Initial plotfile written<br>
Driver init all done<br>
iteration, no. not moved = 0 165494</span></font></div>
<div><font size="1"><span style="font-family:"courier new",monospace">slurmstepd: error: *** JOB 3889364 ON c476-064 CANCELLED AT 2019-07-02T23:54:57 ***</span></font></div>
<div>
<div><br>
</div>
<div>I have found that the particular MPI function it hangs is at line 360 in <span style="font-family:"courier new",monospace">
send_block_data.F90<span style="font-family:arial,sans-serif">, but I am not sure how to further debug the problem.</span></span><br>
</div>
<div>359 If (nvar > 0) Then<br>
360 Call MPI_SSEND (unk(1,is_unk,js_unk,ks_unk,lb), & <br>
361 1, & <br>
362 unk_int_type, & <br>
363 new_loc(2,lb), & <br>
364 new_loc(1,lb), & <br>
365 amr_mpi_meshComm, & <br>
366 ierr)<br>
</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</body>
</html>