<div dir="ltr"><div>Hi Klaus and Shimon,<br><br><div>Commenting out PPDEFINE PM_OPTIMIZE_MORTONBND_FETCHLIST works, happily. The simulation is able to proceed from checkpoint with 20 procs and Multigrid configured+enabled (no dependence on maxblocks, NGUARD); the plt outputs look reasonable.<br></div></div><br><div>It looks like a deeper fix would require some study of the block tree, as Shimon noted separately. I've copied Shimon's message + some other in-line responses below. Thank you both for the many suggestions.<br></div><div><br></div><div>Best,</div><div>Aaron<br></div><div><br></div><div><br></div><div>---------- Forwarded message ---------<br>From: simondonxq <<a href="mailto:simondonxq@gmail.com">simondonxq@gmail.com</a>><br>Date: Tue, May 21, 2019 at 10:43 PM<br>Subject: Re: [FLASH-USERS] amr_1blk_cc_cp_remote segfault during GC filling<br>To: Aaron Tran <<a href="mailto:aaron.tran@columbia.edu">aaron.tran@columbia.edu</a>><br>Hi Aaron,<br>
The segfault occurs during a guardcell filling opreation. To do a
guardcell filling, amr_guardcell will call amr_1blk_guardcell_srl to get
data from surrouding blocks of the same level.<br>If a certain
neighbouring block is on a remote processor, then prior to amr_guardcell
the mpi_amr_comm_setup is to be called to prepare the remote block data
into temprecv_buf.<br>Then when amr_1blk_guardcell_srl calls
amr_1blk_cc_cp_remote, this subroutine will try to find the data for the
remote block inside temprecv_buf.<br> So, to debug the segfault,
you need to carefully analysis the block tree to find on which block,
and on its which neighbours, the amr_1blk_cc_cp_remote using
temprecv_buf leads to segfault. <span class="sewks6uzqgou7ze"></span><span class="sewks6uzqgou7ze"></span><br>________________________________<br>Shimon</div><div>---------- End forwarded message ---------</div><div><br></div><div dir="ltr"><br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> * +pm4dev_fixed - probably not necessary any more since just +pm4dev<br>
does the same thing these days. This may be a leftover from several<br>
versions back, when the default configuration of the PARAMESH Grid<br>
was different.<br></blockquote><div><br></div><div>Got it.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
* -maxblocks=300 with blocks of size 16x16x16 (+guard cells) may be<br>
quite a lot, are you sure you are not running into memory problems?<br></blockquote><div><br></div><div>No memory problems, though it is near the memory limit, as determined by prior trial-and-error. Tests with maxblocks=20 and 100 still reproduce the segfault.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
* +supportPPMupwind is unusual, in that it requires 6 layers of guard<br>
cells, rather than the ususal default of 4.<br>
- Is this increased number represented in Flash.h (as NGUARD); are<br>
there any places that somehow still assume or imply that nguard=4?<br>
(maybe runtime parameters?)<br></blockquote><div><br></div><div>Yes, Flash.h shows #define NGUARD 6. Most paramesh variables are default (except: nblockx,y,z, refinement settings, gr_sanitize{...}). In any case, a run with NGUARD=4 reproduces the error.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Some things you may want to try, they *could* help but I don't <br>
think it is very likely:<br>
* In the file<br>
source/Grid/GridMain/paramesh/interpolation/Paramesh4/prolong/Config<br>
comment out the line<br>
PPDEFINE PM_OPTIMIZE_MORTONBND_FETCHLIST<br></blockquote><div><br></div><div>This change, by itself, appears to fix (side-step?) the problem.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
* Make sure you have runtime parameter enableMaskedGCFill=FALSE<br></blockquote><div><br></div><div>Confirmed.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
* Try a different size (probably larger factor than the default of 10)<br>
for maxblocks_tr and maxblocks_alloc; this may need changes in<br>
both paramesh_dimensions.F90 and amr_initialize.F90.<br></blockquote><div><br></div><div>No difference: tried setting both 10x larger than default, with both maxblocks=20 and 200.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Try not using +supportPPMupwind, make sure NGUARD is 4, see whether<br>
problem disappears.<br></blockquote><div><br></div>The segfault persists. I confirmed that (1) Flash.h has NGUARD=4, (2) core dump shows amr_1blk_cc_cp_remote(...) is called with id,jd,kd and is,js,ks, ilays,jlays,klays appropriate for a 4x4x16 guard cell region, and (3) the cause of segfault, temprecv_buf(indx+ivar) going out-of-bounds, remains the same. The problematic region is the same as before (same remote and destination proc+block).<div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> Lastly, if Multigrid is omitted:<br>
> <br>
> ./setup GCCrash -a -3d +usm +supportPPMupwind +pm4dev_fixed \<br>
> -maxblocks=300 +cube16<br>
> <br>
> the segfault disappears and the simulation proceeds successfully, for at<br>
> least a step or two.<br>
<br>
Is the same true if you leave Multigrid configured in, but only disable<br>
it at runtime? (I think useGravity=.FALSE. and/or updateGravity=.FALSE.<br>
do this.)<br></blockquote><div><br></div><div>The segfault persists with both useGravity=.false. and updateGravity=.false.<br></div></div></div><div>
</div></div>