[FLASH-USERS] Problems running at higher levels of refinement.

Joshua Wall joshua.e.wall at gmail.com
Wed Apr 19 17:00:35 EDT 2017


Hello Alex,

      This is a strange error if you are using only native Flash. I
currently control Flash from Python by forking to make threads to run Flash
under, but this is purposely done at the beginning of a run. It shouldn't
occur during a run (unless you have made processes/threads to handle your
N-body). If you are spawning processes during the run, you can safely turn
off the fork warning (which is what I do in my runs) by calling mpirun as
detailed here
https://www.open-mpi.org/faq/?category=tuning#setting-mca-params which
should look something like:

mpirun --mca mpi_warn_on_fork 0 -np 96 ./flash4

Otherwise I'd make sure you have gdb debugging on for Flash during compile
and either look at the core dump file with gdb or attach gdb to a running
process to investigate what MPI is doing the moment it tries to fork. Some
helpful links for doing this:

Turn on core dumps:
http://stackoverflow.com/questions/17965/how-to-generate-a-core-dump-in-linux-when-a-process-gets-a-segmentation-fault

Use gdb with Open-MPI by attaching to running processes:
https://www.open-mpi.org/faq/?category=debugging#serial-debuggers
also:
http://stackoverflow.com/questions/329259/how-do-i-debug-an-mpi-program

Debugging MPI programs is a bit of a black art and a bit different than
usual debugging. It helps to have lot of patience (and sometimes a plush
toy to toss!). Best of luck.

Cordially,

Joshua Wall

On Tue, Apr 11, 2017 at 3:03 PM Alexander Sheardown <
A.Sheardown at 2011.hull.ac.uk> wrote:

> Hello Everyone,
>
> I am running N-Body + Hydro galaxy cluster merger simulations but I am
> running into problems when trying to run with higher levels of refinement.
>
> My simulation has a box size 8 Mpc x 8 Mpc and contains 2 million
> particles and is refining on density. If I run the simulation on a
> maximum refinement level of 6, the simulation runs fine and completes its
> run. However if I turn the max refine level up to 7 or 8, the simulation
> only gets so far (this varies, it doesn't stop at the same point everytime)
> and exits with the MPI error in the output file:
>
> --------------------------------------------------------------------------
>
> An MPI process has executed an operation involving a call to the
>
> "fork()" system call to create a child process.  Open MPI is currently
>
> operating in a condition that could result in memory corruption or
>
> other system errors; your MPI job may hang, crash, or produce silent
>
> data corruption.  The use of fork() (or system() or other calls that
>
> create child processes) is strongly discouraged.
>
>
> The process that invoked fork was:
>
>
>   Local host:          c127 (PID 108285)
>
>   MPI_COMM_WORLD rank: 414
>
>
> If you are *absolutely sure* that your application will successfully
>
> and correctly survive a call to fork(), you may disable this warning
>
> by setting the mpi_warn_on_fork MCA parameter to 0.
>
> --------------------------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> mpirun noticed that process rank 429 with PID 0 on node c128 exited on
> signal 11 (Segmentation fault).
>
> --------------------------------------------------------------------------
>
>
> ..and the error file shows:
>
> Backtrace for this error:
>
> #0  0x7F073AAD9417
>
> #1  0x7F073AAD9A2E
>
> #2  0x7F0739DC124F
>
> #3  0x454665 in amr_1blk_cc_cp_remote_ at amr_1blk_cc_cp_remote.F90:356
>
> #4  0x4759AE in amr_1blk_guardcell_srl_ at amr_1blk_guardcell_srl.F90:370
>
> #5  0x582550 in amr_1blk_guardcell_ at mpi_amr_1blk_guardcell.F90:743
>
> #6  0x5DB143 in amr_guardcell_ at mpi_amr_guardcell.F90:299
>
> #7  0x41BFDA in grid_fillguardcells_ at Grid_fillGuardCells.F90:456
>
> #8  0x5569A3 in hy_ppm_sweep_ at hy_ppm_sweep.F90:229
>
> #9  0x430A3A in hydro_ at Hydro.F90:87
>
> #10  0x409904 in driver_evolveflash_ at Driver_evolveFlash.F90:275
>
> #11  0x404B16 in flash at Flash.F90:51
>
> #12  0x7F0739DADB34
>
>
> Since this showed a memory issue I doubled the number of nodes I am
> running on but the simulation fails straight away with this in the output
> file (nothing appears in the error file):
>
>
> --------------------------------------------------------------------------
>
> mpirun noticed that process rank 980 with PID 0 on node c096 exited on
> signal 9 (Killed).
>
> --------------------------------------------------------------------------
>
>
>
> In terms of the simulation itself, looking at the output data that I can
> get out everything looks fine in terms of the physics, so I can't decide
> whether this is a problem with my simulation or the MPI I am using.
>
>
> Are there any parameters I could include in the simulation that would
> print out say the number of particles per processor at a given time? or any
> other diagnostics to do with particles? One thought I am wondering is are
> there too many particles landing on a processor or something related.
>
> For info if anyone has had related MPI problems with FLASH the modules I
> am using are:
> hdf5/gcc/openmpi/1.8.16
>
> openmpi/gcc/1.10.5
>
> I would greatly appreciate any thoughts or opinions on what could cause it
> to fail with higher levels of refinement.
>
> Many Thanks,
> Alex
>
> ------------------------------
> *Mr Alex Sheardown*
> Postgraduate Research Student
>
> E.A. Milne Centre for Astrophysics
> University of Hull
> Cottingham Road
> Kingston upon Hull
> HU6 7RX
>
> www.milne.hull.ac.uk
> <https://mail.hull.ac.uk/owa/redir.aspx?REF=_wok6-STjTeTuQlVeEE3DYaVcvKXJXINIb2ho14u7UoAceEsmknTCAFodHRwOi8vd3d3Lm1pbG5lLmh1bGwuYWMudWs.>
> **************************************************
> To view the terms under which this email is
> distributed, please go to
> http://www2.hull.ac.uk/legal/disclaimer.aspx
> **************************************************

-- 
Joshua Wall
Doctoral Candidate
Department of Physics
Drexel University
3141 Chestnut Street
Philadelphia, PA 19104
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20170419/c9e22114/attachment-0001.htm>


More information about the flash-users mailing list