<div dir="ltr"><div>I should further note that NASA just emailed me and has indicated that they could not find anything wrong on the MPI or communications side when they looked at the log files (OpenMPI appears to be doing what it should be).<br><br></div>Josh<br><div><div dir="ltr"><div><br><div class="gmail_quote"><div dir="ltr">On Mon, Mar 7, 2016 at 3:33 PM Joshua Wall <<a href="mailto:joshua.e.wall@gmail.com" target="_blank">joshua.e.wall@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div><div><div>Dear users:<br><br></div> I've been doing some testing on Pleiades, and almost have everything working. I got the following error today while running a test on 150 processors, and am trying to make sure I know what the problem here is. It runs through a number of steps before I get:<br><br> *** Wrote plotfile to /nobackupp8/jewall/turbsph/turbsph_hdf5_plt_cnt_0010 <br>****
WARNING: globalNumParticles = 0!!!
<br>*** Wrote particle file to /nobackupp8/jewall/turbsph/turbsph_hdf5_part_0010 ****
<br> iteration, no. not moved = 0 7618
<br> iteration, no. not moved = 1 1138
<br> iteration, no. not moved = 2 0
<br> refined: total leaf blocks = 8891
refined: total blocks = 10161<br>Paramesh error : pe 18 needed full blk 65 17 but could not find it or only found part of it in the message buffer. Contact PARAMESH developers for help.
<br><br></div><div>the last bit of the log file wasn't too helpful either:<br></div><div><br> [ 03-07-2016 11:43:22.005 ] [gr_hgSolve]: gr_hgSolve: ite 0: norm(residual)/norm(src) = 5.794046E-04
<br>[ 03-07-2016 11:43:22.006 ] [mpi_amr_comm_setup]: buffer_dim_send=42989, buffer_dim_recv=33593
<br>[ 03-07-2016 11:43:22.081 ] [mpi_amr_comm_setup]: buffer_dim_send=36341, buffer_dim_recv=31453
<br> [ 03-07-2016 11:43:22.128 ] [mpi_amr_comm_setup]: buffer_dim_send=29693, buffer_dim_recv=29313
<br> [ 03-07-2016 11:43:22.274 ] [gr_hgSolve]: gr_hgSolve: ite 1: norm(residual)/norm(src) = 5.835671E-05
<br>[ 03-07-2016 11:43:22.275 ] [mpi_amr_comm_setup]: buffer_dim_send=42989, buffer_dim_recv=33593
<br>[ 03-07-2016 11:43:22.351 ] [mpi_amr_comm_setup]: buffer_dim_send=36341, buffer_dim_recv=31453
<br>[ 03-07-2016 11:43:22.394 ] [mpi_amr_comm_setup]: buffer_dim_send=29693, buffer_dim_recv=29313<br>[ 03-07-2016 11:43:22.540 ] [gr_hgSolve]: gr_hgSolve: ite 2: norm(residual)/norm(src) = 2.329683E-06
<br>[ 03-07-2016 11:43:22.541 ] [mpi_amr_comm_setup]: buffer_dim_send=42989, buffer_dim_recv=33593<br> [ 03-07-2016 11:43:22.618 ] [mpi_amr_comm_setup]: buffer_dim_send=36341, buffer_dim_recv=31453
<br> [ 03-07-2016 11:43:22.661 ] [mpi_amr_comm_setup]: buffer_dim_send=29693, buffer_dim_recv=29313
<br>[ 03-07-2016 11:43:22.805 ] [gr_hgSolve]: gr_hgSolve: ite 3: norm(residual)/norm(src) = 1.244917E-07
<br> [ 03-07-2016 11:43:23.283 ] [IO_writePlotfile] open: type=plotfile name=/nobackupp8/jewall/turbsph/turbsph_hdf5_plt_cnt_0010<br>[ 03-07-2016 11:43:25.511 ] [IO_writePlotfile] close: type=plotfile name=/nobackupp8/jewall/turbsph/turbsph_hdf5_plt_cnt_0010
<br>[ 03-07-2016 11:43:25.514 ] [IO_writeParticles] open: type=particles name=/nobackupp8/jewall/turbsph/turbsph_hdf5_part_0010
<br> [ 03-07-2016 11:43:25.514 ] [IO_writeParticles]: done called Particles_updateAttributes()
<br>[ 03-07-2016 11:43:25.529 ] [IO_writeParticles] close: type=particles name=/nobackupp8/jewall/turbsph/turbsph_hdf5_part_0010
<br> [ 03-07-2016 11:43:25.578 ] [mpi_amr_comm_setup]: buffer_dim_send=723197, buffer_dim_recv=704385
<br> [ 03-07-2016 11:43:25.907 ] [GRID amr_refine_derefine]: initiating refinement
<br>[ 03-07-2016 11:43:25.917 ] [GRID amr_refine_derefine]: redist. phase. tot blks requested: 10161
<br> [GRID amr_refine_derefine] min blks 67 max blks 69 tot blks 10161
<br> [GRID amr_refine_derefine] min leaf blks 58 max leaf blks 60 tot leaf blks 8891
<br>[ 03-07-2016 11:43:25.929 ] [GRID amr_refine_derefine]: refinement complete<br><br></div>I'm guessing the issue is the buffer wasn't able to pass a big enough message? Has anyone experienced this error before on a large cluster? I'm wondering if there is a pre-set buffer limit on NASA's fiber somehow. I've actually run this test code before on a cluster with the same number of processors (150) using OpenMPI 1.10.2 which I built without a problem (hence why I picked it to test with). I've also been sorting through an issue where communication between nodes works, but between islands of nodes was not working. I think I got that fixed though by adding the mca parameter <span style="font-family:courier new,courier,monospace;font-size:14px">--mca oob_tcp_if_include ib0 </span>but it feels worth mentioning.<br><br></div>I'm using a "home built" version of OpenMPI 1.10.2 with NASA's recommended settings:<br><br><div><span style="font-family:courier new,courier,monospace">./configure --with-tm=/PBS --with-verbs=/usr --enable-mca-no-build=maffinity-libnuma
--with-cuda=/nasa/cuda/7.0 --enable-mpi-interface-warning
--without-slurm --without-loadleveler --enable-mpirun-prefix-by-default </span></div><span style="font-family:courier new,courier,monospace">--enable-btl-openib-failover --prefix=/u/jewall/ompi-1.10.2<br><br></span></div>Any ideas are very welcome, even if it just helps me ask the right question to the OpenMPI users group or the NASA help desk.<br><br></div><div>Cordially,<br><br></div><div>Josh<br></div><span style="font-family:courier new,courier,monospace"></span></div><div dir="ltr">-- <br></div><div dir="ltr"><div><div><div><div><div>Joshua Wall<br></div>Doctoral Candidate<br></div>Department of Physics<br></div>Drexel University<br></div>3141 Chestnut Street<br></div>Philadelphia, PA 19104<br></div></blockquote></div></div></div></div></div><div dir="ltr">-- <br></div><div dir="ltr"><div><div><div><div><div>Joshua Wall<br></div>Doctoral Candidate<br></div>Department of Physics<br></div>Drexel University<br></div>3141 Chestnut Street<br></div>Philadelphia, PA 19104<br></div>