<div dir="ltr"><div><div><div><div><div>Dear users:<br><br></div> I've been doing some testing on Pleiades, and almost have everything working. I got the following error today while running a test on 150 processors, and am trying to make sure I know what the problem here is. It runs through a number of steps before I get:<br><br> *** Wrote plotfile to /nobackupp8/jewall/turbsph/turbsph_hdf5_plt_cnt_0010 <br>****
WARNING: globalNumParticles = 0!!!
<br>*** Wrote particle file to /nobackupp8/jewall/turbsph/turbsph_hdf5_part_0010 ****
<br> iteration, no. not moved = 0 7618
<br> iteration, no. not moved = 1 1138
<br> iteration, no. not moved = 2 0
<br> refined: total leaf blocks = 8891
refined: total blocks = 10161<br>Paramesh error : pe 18 needed full blk 65 17 but could not find it or only found part of it in the message buffer. Contact PARAMESH developers for help.
<br><br></div><div>the last bit of the log file wasn't too helpful either:<br></div><div><br> [ 03-07-2016 11:43:22.005 ] [gr_hgSolve]: gr_hgSolve: ite 0: norm(residual)/norm(src) = 5.794046E-04
<br>[ 03-07-2016 11:43:22.006 ] [mpi_amr_comm_setup]: buffer_dim_send=42989, buffer_dim_recv=33593
<br>[ 03-07-2016 11:43:22.081 ] [mpi_amr_comm_setup]: buffer_dim_send=36341, buffer_dim_recv=31453
<br> [ 03-07-2016 11:43:22.128 ] [mpi_amr_comm_setup]: buffer_dim_send=29693, buffer_dim_recv=29313
<br> [ 03-07-2016 11:43:22.274 ] [gr_hgSolve]: gr_hgSolve: ite 1: norm(residual)/norm(src) = 5.835671E-05
<br>[ 03-07-2016 11:43:22.275 ] [mpi_amr_comm_setup]: buffer_dim_send=42989, buffer_dim_recv=33593
<br>[ 03-07-2016 11:43:22.351 ] [mpi_amr_comm_setup]: buffer_dim_send=36341, buffer_dim_recv=31453
<br>[ 03-07-2016 11:43:22.394 ] [mpi_amr_comm_setup]: buffer_dim_send=29693, buffer_dim_recv=29313<br>[ 03-07-2016 11:43:22.540 ] [gr_hgSolve]: gr_hgSolve: ite 2: norm(residual)/norm(src) = 2.329683E-06
<br>[ 03-07-2016 11:43:22.541 ] [mpi_amr_comm_setup]: buffer_dim_send=42989, buffer_dim_recv=33593<br> [ 03-07-2016 11:43:22.618 ] [mpi_amr_comm_setup]: buffer_dim_send=36341, buffer_dim_recv=31453
<br> [ 03-07-2016 11:43:22.661 ] [mpi_amr_comm_setup]: buffer_dim_send=29693, buffer_dim_recv=29313
<br>[ 03-07-2016 11:43:22.805 ] [gr_hgSolve]: gr_hgSolve: ite 3: norm(residual)/norm(src) = 1.244917E-07
<br> [ 03-07-2016 11:43:23.283 ] [IO_writePlotfile] open: type=plotfile name=/nobackupp8/jewall/turbsph/turbsph_hdf5_plt_cnt_0010<br>[ 03-07-2016 11:43:25.511 ] [IO_writePlotfile] close: type=plotfile name=/nobackupp8/jewall/turbsph/turbsph_hdf5_plt_cnt_0010
<br>[ 03-07-2016 11:43:25.514 ] [IO_writeParticles] open: type=particles name=/nobackupp8/jewall/turbsph/turbsph_hdf5_part_0010
<br> [ 03-07-2016 11:43:25.514 ] [IO_writeParticles]: done called Particles_updateAttributes()
<br>[ 03-07-2016 11:43:25.529 ] [IO_writeParticles] close: type=particles name=/nobackupp8/jewall/turbsph/turbsph_hdf5_part_0010
<br> [ 03-07-2016 11:43:25.578 ] [mpi_amr_comm_setup]: buffer_dim_send=723197, buffer_dim_recv=704385
<br> [ 03-07-2016 11:43:25.907 ] [GRID amr_refine_derefine]: initiating refinement
<br>[ 03-07-2016 11:43:25.917 ] [GRID amr_refine_derefine]: redist. phase. tot blks requested: 10161
<br> [GRID amr_refine_derefine] min blks 67 max blks 69 tot blks 10161
<br> [GRID amr_refine_derefine] min leaf blks 58 max leaf blks 60 tot leaf blks 8891
<br>[ 03-07-2016 11:43:25.929 ] [GRID amr_refine_derefine]: refinement complete<br><br></div>I'm guessing the issue is the buffer wasn't able to pass a big enough message? Has anyone experienced this error before on a large cluster? I'm wondering if there is a pre-set buffer limit on NASA's fiber somehow. I've actually run this test code before on a cluster with the same number of processors (150) using OpenMPI 1.10.2 which I built without a problem (hence why I picked it to test with). I've also been sorting through an issue where communication between nodes works, but between islands of nodes was not working. I think I got that fixed though by adding the mca parameter <span style="font-family:courier new,courier,monospace;font-size:14px">--mca oob_tcp_if_include ib0 </span>but it feels worth mentioning.<br><br></div>I'm using a "home built" version of OpenMPI 1.10.2 with NASA's recommended settings:<br><br><div><span style="font-family:courier new,courier,monospace">./configure --with-tm=/PBS --with-verbs=/usr --enable-mca-no-build=maffinity-libnuma
--with-cuda=/nasa/cuda/7.0 --enable-mpi-interface-warning
--without-slurm --without-loadleveler --enable-mpirun-prefix-by-default </span></div><span style="font-family:courier new,courier,monospace">--enable-btl-openib-failover --prefix=/u/jewall/ompi-1.10.2<br><br></span></div>Any ideas are very welcome, even if it just helps me ask the right question to the OpenMPI users group or the NASA help desk.<br><br></div><div>Cordially,<br><br></div><div>Josh<br></div><span style="font-family:courier new,courier,monospace"></span></div><div dir="ltr">-- <br></div><div dir="ltr"><div><div><div><div><div>Joshua Wall<br></div>Doctoral Candidate<br></div>Department of Physics<br></div>Drexel University<br></div>3141 Chestnut Street<br></div>Philadelphia, PA 19104<br></div>