[FLASH-USERS] Problems running at higher levels of refinement.

Mon Apr 24 09:05:14 EDT 2017

Hello Joshua,

Thanks for the reply.

We appear to have a workaround for the MPI forking problem by reducing the number of CPUs per node, this has solved the problem so far. The only issue now is the simulation will stop at some point and give this error in the output file:

--------------------------------------------------------------------------

mpirun noticed that process rank 355 with PID 0 on node c069 exited on signal 11 (Segmentation fault).

--------------------------------------------------------------------------

with the error file showing:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

#0  0x7F57EC0A7467

#1  0x7F57EC0A7AAE

#2  0x7F57EB39224F

#3  0x4595CD in amr_1blk_cc_cp_remote_ at amr_1blk_cc_cp_remote.F90:356

#4  0x47F686 in amr_1blk_guardcell_srl_ at amr_1blk_guardcell_srl.F90:753

#5  0x59DA10 in amr_1blk_guardcell_ at mpi_amr_1blk_guardcell.F90:743

#6  0x5F7273 in amr_guardcell_ at mpi_amr_guardcell.F90:301

#7  0x41CC9A in grid_fillguardcells_ at Grid_fillGuardCells.F90:460

#8  0x56FC61 in hy_uhd_unsplit_ at hy_uhd_unsplit.F90:296

#9  0x437645 in hydro_ at Hydro.F90:67

#10  0x409FC1 in driver_evolveflash_ at Driver_evolveFlash.F90:290

#11  0x404CB6 in flash at Flash.F90:51

#12  0x7F57EB37EB34

There seems to be some issue transferring guard cell information between blocks. I had come across this http://flash.uchicago.edu/pipermail/flash-users/2015-February/001637.html which details an error with gcc 4.9.1 (although I am now using openmpi/gcc/1.10.5)when filling guard cells and details a work around by adding a few lines to the end of the Makefile which I have added. This is what I have been using. Interestingly, if I don't add these lines to the end of the Makefile the simulation will stop straight away after producing the initial plot file. However if I include the lines then the simulation will run alot further than the initial plot file but will eventually stop with the error shown above.

Have you seen any related issue like this in FLASH before with problems transferring guard cell info?

Cheers,

Alex

________________________________
Mr Alex Sheardown
Postgraduate Research Student

E.A. Milne Centre for Astrophysics
University of Hull
Cottingham Road
Kingston upon Hull
HU6 7RX

www.milne.hull.ac.uk<https://mail.hull.ac.uk/owa/redir.aspx?REF=_wok6-STjTeTuQlVeEE3DYaVcvKXJXINIb2ho14u7UoAceEsmknTCAFodHRwOi8vd3d3Lm1pbG5lLmh1bGwuYWMudWs.>
________________________________
From: Joshua Wall [joshua.e.wall at gmail.com]
Sent: 19 April 2017 22:00
To: Alexander Sheardown; flash-users at flash.uchicago.edu
Subject: Re: [FLASH-USERS] Problems running at higher levels of refinement.

Hello Alex,

      This is a strange error if you are using only native Flash. I currently control Flash from Python by forking to make threads to run Flash under, but this is purposely done at the beginning of a run. It shouldn't occur during a run (unless you have made processes/threads to handle your N-body). If you are spawning processes during the run, you can safely turn off the fork warning (which is what I do in my runs) by calling mpirun as detailed here https://www.open-mpi.org/faq/?category=tuning#setting-mca-params which should look something like:

mpirun --mca mpi_warn_on_fork 0 -np 96 ./flash4

Otherwise I'd make sure you have gdb debugging on for Flash during compile and either look at the core dump file with gdb or attach gdb to a running process to investigate what MPI is doing the moment it tries to fork. Some helpful links for doing this:

Turn on core dumps: http://stackoverflow.com/questions/17965/how-to-generate-a-core-dump-in-linux-when-a-process-gets-a-segmentation-fault

Use gdb with Open-MPI by attaching to running processes: https://www.open-mpi.org/faq/?category=debugging#serial-debuggers
also: http://stackoverflow.com/questions/329259/how-do-i-debug-an-mpi-program

Debugging MPI programs is a bit of a black art and a bit different than usual debugging. It helps to have lot of patience (and sometimes a plush toy to toss!). Best of luck.

Cordially,

Joshua Wall

On Tue, Apr 11, 2017 at 3:03 PM Alexander Sheardown <A.Sheardown at 2011.hull.ac.uk<mailto:A.Sheardown at 2011.hull.ac.uk>> wrote:
Hello Everyone,

I am running N-Body + Hydro galaxy cluster merger simulations but I am running into problems when trying to run with higher levels of refinement.

My simulation has a box size 8 Mpc x 8 Mpc and contains 2 million particles and is refining on density. If I run the simulation on a maximum refinement level of 6, the simulation runs fine and completes its run. However if I turn the max refine level up to 7 or 8, the simulation only gets so far (this varies, it doesn't stop at the same point everytime) and exits with the MPI error in the output file:

--------------------------------------------------------------------------

An MPI process has executed an operation involving a call to the

"fork()" system call to create a child process.  Open MPI is currently

operating in a condition that could result in memory corruption or

other system errors; your MPI job may hang, crash, or produce silent

data corruption.  The use of fork() (or system() or other calls that

create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          c127 (PID 108285)

  MPI_COMM_WORLD rank: 414

If you are *absolutely sure* that your application will successfully

and correctly survive a call to fork(), you may disable this warning

by setting the mpi_warn_on_fork MCA parameter to 0.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

mpirun noticed that process rank 429 with PID 0 on node c128 exited on signal 11 (Segmentation fault).

--------------------------------------------------------------------------

..and the error file shows:

Backtrace for this error:

#0  0x7F073AAD9417

#1  0x7F073AAD9A2E

#2  0x7F0739DC124F

#3  0x454665 in amr_1blk_cc_cp_remote_ at amr_1blk_cc_cp_remote.F90:356

#4  0x4759AE in amr_1blk_guardcell_srl_ at amr_1blk_guardcell_srl.F90:370

#5  0x582550 in amr_1blk_guardcell_ at mpi_amr_1blk_guardcell.F90:743

#6  0x5DB143 in amr_guardcell_ at mpi_amr_guardcell.F90:299

#7  0x41BFDA in grid_fillguardcells_ at Grid_fillGuardCells.F90:456

#8  0x5569A3 in hy_ppm_sweep_ at hy_ppm_sweep.F90:229

#9  0x430A3A in hydro_ at Hydro.F90:87

#10  0x409904 in driver_evolveflash_ at Driver_evolveFlash.F90:275

#11  0x404B16 in flash at Flash.F90:51

#12  0x7F0739DADB34

Since this showed a memory issue I doubled the number of nodes I am running on but the simulation fails straight away with this in the output file (nothing appears in the error file):

--------------------------------------------------------------------------

mpirun noticed that process rank 980 with PID 0 on node c096 exited on signal 9 (Killed).

--------------------------------------------------------------------------

In terms of the simulation itself, looking at the output data that I can get out everything looks fine in terms of the physics, so I can't decide whether this is a problem with my simulation or the MPI I am using.

Are there any parameters I could include in the simulation that would print out say the number of particles per processor at a given time? or any other diagnostics to do with particles? One thought I am wondering is are there too many particles landing on a processor or something related.

For info if anyone has had related MPI problems with FLASH the modules I am using are:
hdf5/gcc/openmpi/1.8.16

openmpi/gcc/1.10.5

I would greatly appreciate any thoughts or opinions on what could cause it to fail with higher levels of refinement.

Many Thanks,
Alex

________________________________
Mr Alex Sheardown
Postgraduate Research Student

E.A. Milne Centre for Astrophysics
University of Hull
Cottingham Road
Kingston upon Hull
HU6 7RX

www.milne.hull.ac.uk<https://mail.hull.ac.uk/owa/redir.aspx?REF=_wok6-STjTeTuQlVeEE3DYaVcvKXJXINIb2ho14u7UoAceEsmknTCAFodHRwOi8vd3d3Lm1pbG5lLmh1bGwuYWMudWs.>
**************************************************
To view the terms under which this email is
distributed, please go to
http://www2.hull.ac.uk/legal/disclaimer.aspx
**************************************************
--
Joshua Wall
Doctoral Candidate
Department of Physics
Drexel University
3141 Chestnut Street
Philadelphia, PA 19104
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20170424/f656f65c/attachment.htm>
-------------- next part --------------
**************************************************
To view the terms under which this email is 
distributed, please go to 
http://www2.hull.ac.uk/legal/disclaimer.aspx
**************************************************