[FLASH-USERS] Buffer issue on Pleiades

Joshua Wall joshua.e.wall at gmail.com
Mon Mar 7 15:34:33 EST 2016


Dear users:

    I've been doing some testing on Pleiades, and almost have everything
working. I got the following error today while running a test on 150
processors, and am trying to make sure I know what the problem here is. It
runs through a number of steps before I get:

*** Wrote plotfile to /nobackupp8/jewall/turbsph/turbsph_hdf5_plt_cnt_0010
**** WARNING: globalNumParticles = 0!!!
*** Wrote particle file to
/nobackupp8/jewall/turbsph/turbsph_hdf5_part_0010 ****
iteration, no. not moved = 0 7618
iteration, no. not moved = 1 1138
iteration, no. not moved = 2 0
refined: total leaf blocks = 8891 refined: total blocks = 10161
Paramesh error : pe 18 needed full blk 65 17 but could not find it or only
found part of it in the message buffer. Contact PARAMESH developers for
help.

the last bit of the log file wasn't too helpful either:

[ 03-07-2016 11:43:22.005 ] [gr_hgSolve]: gr_hgSolve: ite 0:
norm(residual)/norm(src) = 5.794046E-04
[ 03-07-2016 11:43:22.006 ] [mpi_amr_comm_setup]: buffer_dim_send=42989,
buffer_dim_recv=33593
[ 03-07-2016 11:43:22.081 ] [mpi_amr_comm_setup]: buffer_dim_send=36341,
buffer_dim_recv=31453
[ 03-07-2016 11:43:22.128 ] [mpi_amr_comm_setup]: buffer_dim_send=29693,
buffer_dim_recv=29313
[ 03-07-2016 11:43:22.274 ] [gr_hgSolve]: gr_hgSolve: ite 1:
norm(residual)/norm(src) = 5.835671E-05
[ 03-07-2016 11:43:22.275 ] [mpi_amr_comm_setup]: buffer_dim_send=42989,
buffer_dim_recv=33593
[ 03-07-2016 11:43:22.351 ] [mpi_amr_comm_setup]: buffer_dim_send=36341,
buffer_dim_recv=31453
[ 03-07-2016 11:43:22.394 ] [mpi_amr_comm_setup]: buffer_dim_send=29693,
buffer_dim_recv=29313
[ 03-07-2016 11:43:22.540 ] [gr_hgSolve]: gr_hgSolve: ite 2:
norm(residual)/norm(src) = 2.329683E-06
[ 03-07-2016 11:43:22.541 ] [mpi_amr_comm_setup]: buffer_dim_send=42989,
buffer_dim_recv=33593
 [ 03-07-2016 11:43:22.618 ] [mpi_amr_comm_setup]: buffer_dim_send=36341,
buffer_dim_recv=31453
[ 03-07-2016 11:43:22.661 ] [mpi_amr_comm_setup]: buffer_dim_send=29693,
buffer_dim_recv=29313
[ 03-07-2016 11:43:22.805 ] [gr_hgSolve]: gr_hgSolve: ite 3:
norm(residual)/norm(src) = 1.244917E-07
[ 03-07-2016 11:43:23.283 ] [IO_writePlotfile] open: type=plotfile
name=/nobackupp8/jewall/turbsph/turbsph_hdf5_plt_cnt_0010
[ 03-07-2016 11:43:25.511 ] [IO_writePlotfile] close: type=plotfile
name=/nobackupp8/jewall/turbsph/turbsph_hdf5_plt_cnt_0010
[ 03-07-2016 11:43:25.514 ] [IO_writeParticles] open: type=particles
name=/nobackupp8/jewall/turbsph/turbsph_hdf5_part_0010
[ 03-07-2016 11:43:25.514 ] [IO_writeParticles]: done called
Particles_updateAttributes()
[ 03-07-2016 11:43:25.529 ] [IO_writeParticles] close: type=particles
name=/nobackupp8/jewall/turbsph/turbsph_hdf5_part_0010
[ 03-07-2016 11:43:25.578 ] [mpi_amr_comm_setup]: buffer_dim_send=723197,
buffer_dim_recv=704385
[ 03-07-2016 11:43:25.907 ] [GRID amr_refine_derefine]: initiating
refinement
[ 03-07-2016 11:43:25.917 ] [GRID amr_refine_derefine]: redist. phase. tot
blks requested: 10161
[GRID amr_refine_derefine] min blks 67 max blks 69 tot blks 10161
[GRID amr_refine_derefine] min leaf blks 58 max leaf blks 60 tot leaf blks
8891
[ 03-07-2016 11:43:25.929 ] [GRID amr_refine_derefine]: refinement complete

I'm guessing the issue is the buffer wasn't able to pass a big enough
message? Has anyone experienced this error before on a large cluster? I'm
wondering if there is a pre-set buffer limit on NASA's fiber somehow. I've
actually run this test code before on a cluster with the same number of
processors (150) using OpenMPI 1.10.2 which I built without a problem
(hence why I picked it to test with). I've also been sorting through an
issue where communication between nodes works, but between islands of nodes
was not working. I think I got that fixed though by adding the mca
parameter --mca oob_tcp_if_include ib0 but it feels worth mentioning.

I'm using a "home built" version of OpenMPI 1.10.2 with NASA's recommended
settings:

./configure --with-tm=/PBS
--with-verbs=/usr --enable-mca-no-build=maffinity-libnuma
--with-cuda=/nasa/cuda/7.0 --enable-mpi-interface-warning --without-slurm
--without-loadleveler --enable-mpirun-prefix-by-default
--enable-btl-openib-failover --prefix=/u/jewall/ompi-1.10.2

Any ideas are very welcome, even if it just helps me ask the right question
to the OpenMPI users group or the NASA help desk.

Cordially,

Josh
-- 
Joshua Wall
Doctoral Candidate
Department of Physics
Drexel University
3141 Chestnut Street
Philadelphia, PA 19104
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20160307/7b481efa/attachment.htm>


More information about the flash-users mailing list