[FLASH-USERS] amr_1blk_cc_cp_remote segfault during GC filling

Aaron Tran aaron.tran at columbia.edu
Thu May 23 17:23:45 EDT 2019


Hi Klaus and Shimon,

Commenting out PPDEFINE PM_OPTIMIZE_MORTONBND_FETCHLIST works, happily.
The simulation is able to proceed from checkpoint with 20 procs and
Multigrid configured+enabled (no dependence on maxblocks, NGUARD); the plt
outputs look reasonable.

It looks like a deeper fix would require some study of the block tree, as
Shimon noted separately.  I've copied Shimon's message + some other in-line
responses below.  Thank you both for the many suggestions.

Best,
Aaron


---------- Forwarded message ---------
From: simondonxq <simondonxq at gmail.com>
Date: Tue, May 21, 2019 at 10:43 PM
Subject: Re: [FLASH-USERS] amr_1blk_cc_cp_remote segfault during GC filling
To: Aaron Tran <aaron.tran at columbia.edu>
Hi Aaron,
     The segfault occurs during a guardcell filling opreation. To do a
guardcell filling, amr_guardcell will call amr_1blk_guardcell_srl to get
data from surrouding blocks of the same level.
If a certain neighbouring block is on a remote processor, then prior to
amr_guardcell the mpi_amr_comm_setup is to be called to prepare the remote
block data into temprecv_buf.
Then when amr_1blk_guardcell_srl calls amr_1blk_cc_cp_remote, this
subroutine will try to find the data for the remote block inside
temprecv_buf.
     So, to debug the segfault, you need to carefully analysis the block
tree to find on which block, and on its which neighbours, the
amr_1blk_cc_cp_remote using temprecv_buf leads to segfault.
________________________________
Shimon
---------- End forwarded message ---------


 * +pm4dev_fixed - probably not necessary any more since just +pm4dev
>    does the same thing these days. This may be a leftover from several
>    versions back, when the default configuration of the PARAMESH Grid
>    was different.
>

Got it.


>  * -maxblocks=300 with blocks of size 16x16x16 (+guard cells) may be
>    quite a lot, are you sure you are not running into memory problems?
>

No memory problems, though it is near the memory limit, as determined by
prior trial-and-error.  Tests with maxblocks=20 and 100 still reproduce the
segfault.

 * +supportPPMupwind is unusual, in that it requires 6 layers of guard
>    cells, rather than the ususal default of 4.
>    - Is this increased number represented in Flash.h (as NGUARD); are
>      there any places that somehow still assume or imply that nguard=4?
>      (maybe runtime parameters?)
>

Yes, Flash.h shows #define NGUARD 6.  Most paramesh variables are default
(except: nblockx,y,z, refinement settings, gr_sanitize{...}).  In any case,
a run with NGUARD=4 reproduces the error.

Some things you may want to try, they *could* help but I don't
> think it is very likely:
>  * In the file
>    source/Grid/GridMain/paramesh/interpolation/Paramesh4/prolong/Config
>    comment out the line
>     PPDEFINE PM_OPTIMIZE_MORTONBND_FETCHLIST
>

This change, by itself, appears to fix (side-step?) the problem.

 * Make sure you have runtime parameter enableMaskedGCFill=FALSE
>

Confirmed.


>  * Try a different size (probably larger factor than the default of 10)
>    for maxblocks_tr and maxblocks_alloc; this may need changes in
>    both paramesh_dimensions.F90 and amr_initialize.F90.
>

No difference: tried setting both 10x larger than default, with both
maxblocks=20 and 200.

Try not using +supportPPMupwind, make sure NGUARD is 4, see whether
> problem disappears.
>

The segfault persists.  I confirmed that (1) Flash.h has NGUARD=4, (2) core
dump shows amr_1blk_cc_cp_remote(...) is called with id,jd,kd and is,js,ks,
ilays,jlays,klays appropriate for a 4x4x16 guard cell region, and (3) the
cause of segfault, temprecv_buf(indx+ivar) going out-of-bounds, remains the
same.  The problematic region is the same as before (same remote and
destination proc+block).

> Lastly, if Multigrid is omitted:
> >
> >     ./setup GCCrash -a -3d +usm +supportPPMupwind +pm4dev_fixed \
> >         -maxblocks=300 +cube16
> >
> > the segfault disappears and the simulation proceeds successfully, for at
> > least a step or two.
>
> Is the same true if you leave Multigrid configured in, but only disable
> it at runtime? (I think useGravity=.FALSE. and/or updateGravity=.FALSE.
> do this.)
>

The segfault persists with both useGravity=.false. and updateGravity=.false.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20190523/e149af67/attachment.htm>


More information about the flash-users mailing list