[FLASH-USERS] FLASH Particle Errors

Evan O'Connor evanoc at cita.utoronto.ca
Fri Dec 20 12:26:02 EST 2013


Hello all,

First time poster, hope I given enough info etc, let me know if not.

I'm doing some 2D, axisymmetric hydro simulations of core collapse
with Sean Couch. We are implementing particles and I am running into
MPI errors and segfaults after running for some time (these have never
occurred when I don't include particles).  For example, I'm running ~30
models, each with 10000 particles and 16 MPI processes (2 nodes), within
the last 24 hours about 70% have failed with particle related errors at
various times. I have tried updating to the most recent openmpi version
(openmpi/1.6.5, with gcc/4.8.2) and the errors persist (I was using
openmpi/1.6.1 and gcc/4.7.2). I did implement my own
ParticleInitialization method, but I doubt this is the issue as there
are initialized and evolve early on fine. But I figured I would mention
it just in case.

At least some of these errors seem robust, i.e. they are reproducible
with checkpoints, and to some extent reproducible even when I change
the number of processes (checkpointing from a 16 MPI process, 2 node
run to a 8 MPI process, 1 node run gives a termination at the same time
step, however a different error, 4 MPI process, on a 8-core node doesn't
crash on a checkpoint restart). 

I'm not sure not to go about debugging this, so I figured I would start
with the users list to solicit advice from Particle experts for any easy
solutions and/or tips on the best place to start.  The errors seem to be
of the following type (see below). I've included some infer about our setup
and how I add the particles in my configuration at the end of the email
(perhaps I am missing something there).

Thanks for any help, Happy Holidays,
Evan O'Connor 

Typical Crash Type 1: invalid rank:

[tpb218:3617] *** An error occurred in MPI_Send
[tpb218:3617] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
[tpb218:3617] *** MPI_ERR_RANK: invalid rank
[tpb218:3617] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 3617 on
node tpb218 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------


Typical Crash Type 2: segfault, invalid memory,  either occurs in
grid_moveparticles or io_writeparticles

a) grid_moveparticles:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x2B588B0082D7
#1  0x2B588B0088DE
#2  0x2B588C25091F
#3  0x2B588C2A7131
#4  0x621E7E in ut_sortonprocs_
#5  0x50C6C4 in gr_ptmovepttopt_
#6  0x448B17 in grid_moveparticles_
#7  0x461EF6 in particles_advance_
#8  0x431750 in driver_evolveflash_
[tpb205][[47673,1],13][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 8352 on node tpb206 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

b) io_writeparticles:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x2B677C23C2D7
#1  0x2B677C23C8DE
#2  0x2B677D48491F
#3  0x5EB058 in __namevaluell_data_MOD_namevaluell_checkreal
#4  0x5ED9E4 in namevaluell_setreal_
#5  0x434329 in driver_sendoutputdata_
#6  0x45560D in io_updatescalars_
#7  0x457E4C in io_writeparticles_
#8  0x4548B1 in io_output_
#9  0x431868 in driver_evolveflash_
[tpb203][[2990,1],13][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 10525 on node tpb204 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------


lines added to configuration file: Shell is implemented in a similar way as LATTICE

PARTICLETYPE passive INITMETHOD shell MAPMETHOD quadratic ADVMETHOD rungekutta
REQUIRES Particles/ParticlesMain
REQUESTS Particles/ParticlesMain/passive/RungeKutta
REQUESTS Particles/ParticlesMapping/Quadratic
REQUESTS Particles/ParticlesInitialization/Shell
REQUIRES IO/IOMain
REQUIRES IO/IOParticles
REQUIRES Grid/GridParticles

PARTICLEPROP dens REAL
PARTICLEPROP temp REAL
PARTICLEPROP ye REAL
PARTICLEPROP velx REAL
PARTICLEPROP vely REAL

The setup line is:

./setup CoreCollapse/leakage -auto -2d +cylindrical -nxb=16 -nyb=16 -objdir ccsn2dLeak threadBlockList=False +pm4dev threadWithinBlock=False +newMpole +uhdopt


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20131220/d687e066/attachment.htm>


More information about the flash-users mailing list