[FLASH-USERS] MPI deadlock in block move after refinement

Rukmani Vijayaraghavan rukmani at virginia.edu
Fri Sep 1 15:23:06 EDT 2017


Hi,

I've been experiencing similar problems with FLASH4 on Stampede2 at 
TACC. My jobs hang on the initial block refinement step.

I wrote to the people at TACC and they recommended running a test 
problem on 1 node 16 cores on Stampede 1 and 1 node 64 cores on Stampede 
2. When I tried this for the basic 2D Sedov test problem, Stampede 2 was 
3 times slower than Stampede 2. Based on their response, it looks like 
FLASH is currently not well optimized for the new KNL nodes of Stampede 
2 (https://portal.tacc.utexas.edu/user-guides/stampede2#bestpractices). 
I'm not sure if anything else is missing -- I tested my simulation setup 
on Stampede 1 and it worked just fine.

Are there any other recommendations or fixes? Has anybody else 
successfully run large FLASH runs on Stampede 2?

Thanks,
Rukmani

> Hello,
>
> I have been experiencing a problem with my simulations hanging at random
> occasions while processing a refinement step (via gr_updateRefinement). The
> issue seems to be related to a mismatch between the processors sending and
> receiving block data in mpi_amr_redist_blk.F90. A stack trace shows that
> there is a single processor still making a call to send_block_data() while
> the rest have moved on to the subsequent MPI_ALLREDUCE() call (see excerpt
> below). The deadlock condition is repeatable: e.g. if it happened at step 3
> it will keep happening at the same point unless I change the grid structure
> by refining more or less. I have some of my own routines that are modifying
> the logical structures for marking blocks, as in refine(blockID) = .true.,
> but am not attempting to modify the grid through any other means. Is it
> possible there is a problem with my MPI setup? I am using FLASH4.4 on the
> new Stampede2 at TACC with 8 nodes by 48 processors each for 384 total
> processors.
>
> Thanks!
> -Jeremy
>
> flash4             000000000079E2D6  send_block_data_          269
> send_block_data.F90
> flash4             000000000070C3E4  amr_redist_blk_           674
> mpi_amr_redist_blk.F90
> flash4             00000000004FE560  amr_morton_order_         164
> amr_morton_order.F90
> flash4             000000000071A7DA  amr_refine_derefi         319
> mpi_amr_refine_derefine.F90
> flash4             00000000005D8B22  gr_updaterefineme         112
> gr_updateRefinement.F90
> flash4             0000000000457A10  grid_updaterefine          98
> Grid_updateRefinement.F90
> flash4             0000000000413F74  driver_evolveflas         390
> Driver_evolveFlash.F90
> flash4             000000000041DBB3  MAIN__                     51
> Flash.F90
>
> flash4             000000000070C50F  amr_redist_blk_           686
> mpi_amr_redist_blk.F90
> flash4             00000000004FE560  amr_morton_order_         164
> amr_morton_order.F90
> flash4             000000000071A7DA  amr_refine_derefi         319
> mpi_amr_refine_derefine.F90
> flash4             00000000005D8B22  gr_updaterefineme         112
> gr_updateRefinement.F90
> flash4             0000000000457A10  grid_updaterefine          98
> Grid_updateRefinement.F90
> flash4             0000000000413F74  driver_evolveflas         390
> Driver_evolveFlash.F90
> flash4             000000000041DBB3  MAIN__                     51
> Flash.F90

-- 
Rukmani Vijayaraghavan
NSF Astronomy & Astrophysics Postdoctoral Fellow
University of Virginia
rukmani at virginia.edu




More information about the flash-users mailing list