[FLASH-USERS] amr_1blk_cc_cp_remote segfault during GC filling

Klaus Weide klaus at flash.uchicago.edu
Tue May 21 22:36:21 EDT 2019


On Tue, 21 May 2019, Aaron Tran wrote:

> Hi flash-users,
> 
> I'm seeing a segfault during guard cell filling (comments + backtrace
> below).  Would anyone have insight into this problem, or similar errors to
> share?  

 
> The segfault occurs while amr_1blk_cc_cp_remote(...) is copying unk
> variables from temprecv_buf to local guard cells.  As the copy operation
> loops over individual cells, the index into temprecv_buf (set by
> amr_mpi_find_blk_in_buffer(...), mpi_set_message_limits(...), and
> ngcell_on_cc) goes out of bounds and segfaults.
> 
> The segfault occurs...
> * with -O0 for multiple MPI implementations + gcc versions
> * for a specific nproc=20, can bypass by restarting with nproc=19 or 21
> * after the sim has already run for ~hundreds of steps
> 
> The segfault can be reproduced by...
> 1. modifying nproc to bypass segfault,
> 2. dumping checkpoint immediately after point where segfault would occur,
> 3. restarting from checkpoint with newly untarred FLASH4.6 and a stub
> simulation; the setup call is
> 
>     ./setup GCCrash -a -3d +usm +supportPPMupwind +pm4dev_fixed \
>         -with-unit=physics/Gravity/GravityMain/Poisson/Multigrid \
>         -maxblocks=300 +cube16
> 

Hi Aaron,

Thank you for providing a problem report with details, and also for 
checking the mailing list archives first!

I don't have a good idea of what exactly might cause your problems,
but here are some observations. Maybe they can help narrow down possible
causes.

You are using an unusual combination of features in your configuration.
So it is not so surprising that you encounter problems. I was actually 
a bit surprised that your configuration usually works - I believe we may 
never have tested this.

More specifically,
 * +pm4dev_fixed - probably not necessary any more since just +pm4dev
   does the same thing these days. This may be a leftover from several
   versions back, when the default configuration of the PARAMESH Grid
   was different.
 * -maxblocks=300 with blocks of size 16x16x16 (+guard cells) may be
   quite a lot, are you sure you are not running into memory problems?
 * +supportPPMupwind is unusual, in that it requires 6 layers of guard
   cells, rather than the ususal default of 4.
   - Is this increased number represented in Flash.h (as NGUARD); are
     there any places that somehow still assume or imply that nguard=4?
     (maybe runtime parameters?)
 * The Multigrid implementation in particular may not have been tested 
   much with NGUARD different from the default.

All this still doesn't explain why occasionally one proc's temprecv_buf
would have too little space.

Some things you may want to try, they *could* help but I don't 
think it is very likely:
 * In the file
   source/Grid/GridMain/paramesh/interpolation/Paramesh4/prolong/Config
   comment out the line
    PPDEFINE PM_OPTIMIZE_MORTONBND_FETCHLIST
 * Make sure you have runtime parameter enableMaskedGCFill=FALSE
 * Try a different size (probably larger factor than the default of 10)
   for maxblocks_tr and maxblocks_alloc; this may need changes in
   both paramesh_dimensions.F90 and amr_initialize.F90.

Try not using +supportPPMupwind, make sure NGUARD is 4, see whether
problem disappears. 

> Lastly, if Multigrid is omitted:
> 
>     ./setup GCCrash -a -3d +usm +supportPPMupwind +pm4dev_fixed \
>         -maxblocks=300 +cube16
> 
> the segfault disappears and the simulation proceeds successfully, for at
> least a step or two.

Is the same true if you leave Multigrid configured in, but only disable
it at runtime? (I think useGravity=.FALSE. and/or updateGravity=.FALSE.
do this.)


Klaus



More information about the flash-users mailing list