[FLASH-USERS] amr_1blk_cc_cp_remote segfault during GC filling

Aaron Tran aaron.tran at columbia.edu
Tue May 21 10:29:37 EDT 2019


Hi flash-users,

I'm seeing a segfault during guard cell filling (comments + backtrace
below).  Would anyone have insight into this problem, or similar errors to
share?  The closest prior report I could find was:
http://flash.uchicago.edu/pipermail/flash-users/2017-April/002259.html
and I didn't see a resolution after that thread.

Many thanks,
Aaron

The segfault occurs while amr_1blk_cc_cp_remote(...) is copying unk
variables from temprecv_buf to local guard cells.  As the copy operation
loops over individual cells, the index into temprecv_buf (set by
amr_mpi_find_blk_in_buffer(...), mpi_set_message_limits(...), and
ngcell_on_cc) goes out of bounds and segfaults.

The segfault occurs...
* with -O0 for multiple MPI implementations + gcc versions
* for a specific nproc=20, can bypass by restarting with nproc=19 or 21
* after the sim has already run for ~hundreds of steps

The segfault can be reproduced by...
1. modifying nproc to bypass segfault,
2. dumping checkpoint immediately after point where segfault would occur,
3. restarting from checkpoint with newly untarred FLASH4.6 and a stub
simulation; the setup call is

    ./setup GCCrash -a -3d +usm +supportPPMupwind +pm4dev_fixed \
        -with-unit=physics/Gravity/GravityMain/Poisson/Multigrid \
        -maxblocks=300 +cube16

Lastly, if Multigrid is omitted:

    ./setup GCCrash -a -3d +usm +supportPPMupwind +pm4dev_fixed \
        -maxblocks=300 +cube16

the segfault disappears and the simulation proceeds successfully, for at
least a step or two.

>From all that, it seems like a specific proc/block layout used with
Multigrid may be responsible, but I've not investigated any further.  I'm
happy to share the stub simulation code + checkpoint file if anyone wishes
to take a look.

The "original" backtrace (modified ver of FLASH 4.5) happens shortly after
a refinement:

    #0  0x2aaaaca3424f in ???
    #1  0x4717dd in amr_1blk_cc_cp_remote_ at amr_1blk_cc_cp_remote.F90:356
    #2  0x492e0e in amr_1blk_guardcell_srl_ at
amr_1blk_guardcell_srl.F90:510
    #3  0x58269e in amr_1blk_guardcell_ at mpi_amr_1blk_guardcell.F90:743
    #4  0x5c01e4 in amr_guardcell_ at mpi_amr_guardcell.F90:301
    #5  0x41c236 in grid_fillguardcells_ at Grid_fillGuardCells.F90:460
    #6  0x556e9b in hy_uhd_unsplit_ at hy_uhd_unsplit.F90:253
    #7  0x433b85 in hydro_ at Hydro.F90:67
    #8  0x40ae27 in driver_evolveflash_ at Driver_evolveFlash.F90:290
    #9  0x404f06 in flash at Flash.F90:51
    #10  0x404f06 in main at Flash.F90:43

The backtrace (FLASH 4.6) from a stub simulation restart is:

    #0  0x2aaaac75e24f in ???
    #1  0x46e8b5 in amr_1blk_cc_cp_remote_ at amr_1blk_cc_cp_remote.F90:356
    #2  0x496688 in amr_1blk_guardcell_srl_ at
amr_1blk_guardcell_srl.F90:510
    #3  0x63b5a0 in amr_1blk_guardcell_ at mpi_amr_1blk_guardcell.F90:743
    #4  0x684214 in amr_guardcell_ at mpi_amr_guardcell.F90:301
    #5  0x42298b in grid_fillguardcells_ at Grid_fillGuardCells.F90:460
    #6  0x433ab8 in grid_initdomain_ at Grid_initDomain.F90:169
    #7  0x40e915 in driver_initflash_ at Driver_initFlash.F90:186
    #8  0x417128 in flash at Flash.F90:49
    #9  0x417169 in main at Flash.F90:43
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20190521/4697456f/attachment.htm>


More information about the flash-users mailing list