[FLASH-USERS] MPI deadlock in block move after refinement

Jeremy S Ritter jritter at mail.utexas.edu
Sat Sep 2 12:07:45 EDT 2017

Hi Rukmani,

I was able to get past the initial hangup by reducing MAXBLOCKS. I normally
use 500, but reduced to 400 then 300 to get through initialization. The
problem occurs much less frequently, but was not eliminated. It still
happens if I refine too much during one step. I have been running with 48
cpus per node so that they each have 2gb of memory like Stampede1. I had
been using Stampede2 fine for a couple of months before this started
happening suddenly, so I suspect it coincides with some software upgrade
they did recently. I opened a ticket with TACC a few weeks ago but they
haven't responded.


On Fri, Sep 1, 2017 at 2:23 PM, Rukmani Vijayaraghavan <rukmani at virginia.edu
> wrote:

> Hi,
> I've been experiencing similar problems with FLASH4 on Stampede2 at TACC.
> My jobs hang on the initial block refinement step.
> I wrote to the people at TACC and they recommended running a test problem
> on 1 node 16 cores on Stampede 1 and 1 node 64 cores on Stampede 2. When I
> tried this for the basic 2D Sedov test problem, Stampede 2 was 3 times
> slower than Stampede 2. Based on their response, it looks like FLASH is
> currently not well optimized for the new KNL nodes of Stampede 2 (
> https://portal.tacc.utexas.edu/user-guides/stampede2#bestpractices). I'm
> not sure if anything else is missing -- I tested my simulation setup on
> Stampede 1 and it worked just fine.
> Are there any other recommendations or fixes? Has anybody else
> successfully run large FLASH runs on Stampede 2?
> Thanks,
> Rukmani
> Hello,
>> I have been experiencing a problem with my simulations hanging at random
>> occasions while processing a refinement step (via gr_updateRefinement).
>> The
>> issue seems to be related to a mismatch between the processors sending and
>> receiving block data in mpi_amr_redist_blk.F90. A stack trace shows that
>> there is a single processor still making a call to send_block_data() while
>> the rest have moved on to the subsequent MPI_ALLREDUCE() call (see excerpt
>> below). The deadlock condition is repeatable: e.g. if it happened at step
>> 3
>> it will keep happening at the same point unless I change the grid
>> structure
>> by refining more or less. I have some of my own routines that are
>> modifying
>> the logical structures for marking blocks, as in refine(blockID) = .true.,
>> but am not attempting to modify the grid through any other means. Is it
>> possible there is a problem with my MPI setup? I am using FLASH4.4 on the
>> new Stampede2 at TACC with 8 nodes by 48 processors each for 384 total
>> processors.
>> Thanks!
>> -Jeremy
>> flash4             000000000079E2D6  send_block_data_          269
>> send_block_data.F90
>> flash4             000000000070C3E4  amr_redist_blk_           674
>> mpi_amr_redist_blk.F90
>> flash4             00000000004FE560  amr_morton_order_         164
>> amr_morton_order.F90
>> flash4             000000000071A7DA  amr_refine_derefi         319
>> mpi_amr_refine_derefine.F90
>> flash4             00000000005D8B22  gr_updaterefineme         112
>> gr_updateRefinement.F90
>> flash4             0000000000457A10  grid_updaterefine          98
>> Grid_updateRefinement.F90
>> flash4             0000000000413F74  driver_evolveflas         390
>> Driver_evolveFlash.F90
>> flash4             000000000041DBB3  MAIN__                     51
>> Flash.F90
> --
> Rukmani Vijayaraghavan
> NSF Astronomy & Astrophysics Postdoctoral Fellow
> University of Virginia
> rukmani at virginia.edu
