[FLASH-USERS] Buffer issue on Pleiades

Mon Mar 7 18:07:30 EST 2016

I should further note that NASA just emailed me and has indicated that they
could not find anything wrong on the MPI or communications side when they
looked at the log files (OpenMPI appears to be doing what it should be).

Josh

On Mon, Mar 7, 2016 at 3:33 PM Joshua Wall <joshua.e.wall at gmail.com> wrote:

> Dear users:
>
>     I've been doing some testing on Pleiades, and almost have everything
> working. I got the following error today while running a test on 150
> processors, and am trying to make sure I know what the problem here is. It
> runs through a number of steps before I get:
>
> *** Wrote plotfile to /nobackupp8/jewall/turbsph/turbsph_hdf5_plt_cnt_0010
> **** WARNING: globalNumParticles = 0!!!
> *** Wrote particle file to
> /nobackupp8/jewall/turbsph/turbsph_hdf5_part_0010 ****
> iteration, no. not moved = 0 7618
> iteration, no. not moved = 1 1138
> iteration, no. not moved = 2 0
> refined: total leaf blocks = 8891 refined: total blocks = 10161
> Paramesh error : pe 18 needed full blk 65 17 but could not find it or only
> found part of it in the message buffer. Contact PARAMESH developers for
> help.
>
> the last bit of the log file wasn't too helpful either:
>
> [ 03-07-2016 11:43:22.005 ] [gr_hgSolve]: gr_hgSolve: ite 0:
> norm(residual)/norm(src) = 5.794046E-04
> [ 03-07-2016 11:43:22.006 ] [mpi_amr_comm_setup]: buffer_dim_send=42989,
> buffer_dim_recv=33593
> [ 03-07-2016 11:43:22.081 ] [mpi_amr_comm_setup]: buffer_dim_send=36341,
> buffer_dim_recv=31453
> [ 03-07-2016 11:43:22.128 ] [mpi_amr_comm_setup]: buffer_dim_send=29693,
> buffer_dim_recv=29313
> [ 03-07-2016 11:43:22.274 ] [gr_hgSolve]: gr_hgSolve: ite 1:
> norm(residual)/norm(src) = 5.835671E-05
> [ 03-07-2016 11:43:22.275 ] [mpi_amr_comm_setup]: buffer_dim_send=42989,
> buffer_dim_recv=33593
> [ 03-07-2016 11:43:22.351 ] [mpi_amr_comm_setup]: buffer_dim_send=36341,
> buffer_dim_recv=31453
> [ 03-07-2016 11:43:22.394 ] [mpi_amr_comm_setup]: buffer_dim_send=29693,
> buffer_dim_recv=29313
> [ 03-07-2016 11:43:22.540 ] [gr_hgSolve]: gr_hgSolve: ite 2:
> norm(residual)/norm(src) = 2.329683E-06
> [ 03-07-2016 11:43:22.541 ] [mpi_amr_comm_setup]: buffer_dim_send=42989,
> buffer_dim_recv=33593
>  [ 03-07-2016 11:43:22.618 ] [mpi_amr_comm_setup]: buffer_dim_send=36341,
> buffer_dim_recv=31453
> [ 03-07-2016 11:43:22.661 ] [mpi_amr_comm_setup]: buffer_dim_send=29693,
> buffer_dim_recv=29313
> [ 03-07-2016 11:43:22.805 ] [gr_hgSolve]: gr_hgSolve: ite 3:
> norm(residual)/norm(src) = 1.244917E-07
> [ 03-07-2016 11:43:23.283 ] [IO_writePlotfile] open: type=plotfile
> name=/nobackupp8/jewall/turbsph/turbsph_hdf5_plt_cnt_0010
> [ 03-07-2016 11:43:25.511 ] [IO_writePlotfile] close: type=plotfile
> name=/nobackupp8/jewall/turbsph/turbsph_hdf5_plt_cnt_0010
> [ 03-07-2016 11:43:25.514 ] [IO_writeParticles] open: type=particles
> name=/nobackupp8/jewall/turbsph/turbsph_hdf5_part_0010
> [ 03-07-2016 11:43:25.514 ] [IO_writeParticles]: done called
> Particles_updateAttributes()
> [ 03-07-2016 11:43:25.529 ] [IO_writeParticles] close: type=particles
> name=/nobackupp8/jewall/turbsph/turbsph_hdf5_part_0010
> [ 03-07-2016 11:43:25.578 ] [mpi_amr_comm_setup]: buffer_dim_send=723197,
> buffer_dim_recv=704385
> [ 03-07-2016 11:43:25.907 ] [GRID amr_refine_derefine]: initiating
> refinement
> [ 03-07-2016 11:43:25.917 ] [GRID amr_refine_derefine]: redist. phase. tot
> blks requested: 10161
> [GRID amr_refine_derefine] min blks 67 max blks 69 tot blks 10161
> [GRID amr_refine_derefine] min leaf blks 58 max leaf blks 60 tot leaf blks
> 8891
> [ 03-07-2016 11:43:25.929 ] [GRID amr_refine_derefine]: refinement complete
>
> I'm guessing the issue is the buffer wasn't able to pass a big enough
> message? Has anyone experienced this error before on a large cluster? I'm
> wondering if there is a pre-set buffer limit on NASA's fiber somehow. I've
> actually run this test code before on a cluster with the same number of
> processors (150) using OpenMPI 1.10.2 which I built without a problem
> (hence why I picked it to test with). I've also been sorting through an
> issue where communication between nodes works, but between islands of nodes
> was not working. I think I got that fixed though by adding the mca
> parameter --mca oob_tcp_if_include ib0 but it feels worth mentioning.
>
> I'm using a "home built" version of OpenMPI 1.10.2 with NASA's recommended
> settings:
>
> ./configure --with-tm=/PBS
> --with-verbs=/usr --enable-mca-no-build=maffinity-libnuma
> --with-cuda=/nasa/cuda/7.0 --enable-mpi-interface-warning --without-slurm
> --without-loadleveler --enable-mpirun-prefix-by-default
> --enable-btl-openib-failover --prefix=/u/jewall/ompi-1.10.2
>
> Any ideas are very welcome, even if it just helps me ask the right
> question to the OpenMPI users group or the NASA help desk.
>
> Cordially,
>
> Josh
> --
> Joshua Wall
> Doctoral Candidate
> Department of Physics
> Drexel University
> 3141 Chestnut Street
> Philadelphia, PA 19104
>
-- 
Joshua Wall
Doctoral Candidate
Department of Physics
Drexel University
3141 Chestnut Street
Philadelphia, PA 19104
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20160307/9aa58543/attachment.htm>