[FLASH-USERS] amr_1blk_cc_cp_remote segfault during GC filling
Aaron Tran
aaron.tran at columbia.edu
Thu May 23 17:23:45 EDT 2019
Hi Klaus and Shimon,
Commenting out PPDEFINE PM_OPTIMIZE_MORTONBND_FETCHLIST works, happily.
The simulation is able to proceed from checkpoint with 20 procs and
Multigrid configured+enabled (no dependence on maxblocks, NGUARD); the plt
outputs look reasonable.
It looks like a deeper fix would require some study of the block tree, as
Shimon noted separately. I've copied Shimon's message + some other in-line
responses below. Thank you both for the many suggestions.
Best,
Aaron
---------- Forwarded message ---------
From: simondonxq <simondonxq at gmail.com>
Date: Tue, May 21, 2019 at 10:43 PM
Subject: Re: [FLASH-USERS] amr_1blk_cc_cp_remote segfault during GC filling
To: Aaron Tran <aaron.tran at columbia.edu>
Hi Aaron,
The segfault occurs during a guardcell filling opreation. To do a
guardcell filling, amr_guardcell will call amr_1blk_guardcell_srl to get
data from surrouding blocks of the same level.
If a certain neighbouring block is on a remote processor, then prior to
amr_guardcell the mpi_amr_comm_setup is to be called to prepare the remote
block data into temprecv_buf.
Then when amr_1blk_guardcell_srl calls amr_1blk_cc_cp_remote, this
subroutine will try to find the data for the remote block inside
temprecv_buf.
So, to debug the segfault, you need to carefully analysis the block
tree to find on which block, and on its which neighbours, the
amr_1blk_cc_cp_remote using temprecv_buf leads to segfault.
________________________________
Shimon
---------- End forwarded message ---------
* +pm4dev_fixed - probably not necessary any more since just +pm4dev
> does the same thing these days. This may be a leftover from several
> versions back, when the default configuration of the PARAMESH Grid
> was different.
>
Got it.
> * -maxblocks=300 with blocks of size 16x16x16 (+guard cells) may be
> quite a lot, are you sure you are not running into memory problems?
>
No memory problems, though it is near the memory limit, as determined by
prior trial-and-error. Tests with maxblocks=20 and 100 still reproduce the
segfault.
* +supportPPMupwind is unusual, in that it requires 6 layers of guard
> cells, rather than the ususal default of 4.
> - Is this increased number represented in Flash.h (as NGUARD); are
> there any places that somehow still assume or imply that nguard=4?
> (maybe runtime parameters?)
>
Yes, Flash.h shows #define NGUARD 6. Most paramesh variables are default
(except: nblockx,y,z, refinement settings, gr_sanitize{...}). In any case,
a run with NGUARD=4 reproduces the error.
Some things you may want to try, they *could* help but I don't
> think it is very likely:
> * In the file
> source/Grid/GridMain/paramesh/interpolation/Paramesh4/prolong/Config
> comment out the line
> PPDEFINE PM_OPTIMIZE_MORTONBND_FETCHLIST
>
This change, by itself, appears to fix (side-step?) the problem.
* Make sure you have runtime parameter enableMaskedGCFill=FALSE
>
Confirmed.
> * Try a different size (probably larger factor than the default of 10)
> for maxblocks_tr and maxblocks_alloc; this may need changes in
> both paramesh_dimensions.F90 and amr_initialize.F90.
>
No difference: tried setting both 10x larger than default, with both
maxblocks=20 and 200.
Try not using +supportPPMupwind, make sure NGUARD is 4, see whether
> problem disappears.
>
The segfault persists. I confirmed that (1) Flash.h has NGUARD=4, (2) core
dump shows amr_1blk_cc_cp_remote(...) is called with id,jd,kd and is,js,ks,
ilays,jlays,klays appropriate for a 4x4x16 guard cell region, and (3) the
cause of segfault, temprecv_buf(indx+ivar) going out-of-bounds, remains the
same. The problematic region is the same as before (same remote and
destination proc+block).
> Lastly, if Multigrid is omitted:
> >
> > ./setup GCCrash -a -3d +usm +supportPPMupwind +pm4dev_fixed \
> > -maxblocks=300 +cube16
> >
> > the segfault disappears and the simulation proceeds successfully, for at
> > least a step or two.
>
> Is the same true if you leave Multigrid configured in, but only disable
> it at runtime? (I think useGravity=.FALSE. and/or updateGravity=.FALSE.
> do this.)
>
The segfault persists with both useGravity=.false. and updateGravity=.false.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20190523/e149af67/attachment-0001.htm>
More information about the flash-users
mailing list