[FLASH-USERS] MPI deadlock in block move after refinement
Rukmani Vijayaraghavan
rukmani at virginia.edu
Fri Sep 1 15:23:06 EDT 2017
Hi,
I've been experiencing similar problems with FLASH4 on Stampede2 at
TACC. My jobs hang on the initial block refinement step.
I wrote to the people at TACC and they recommended running a test
problem on 1 node 16 cores on Stampede 1 and 1 node 64 cores on Stampede
2. When I tried this for the basic 2D Sedov test problem, Stampede 2 was
3 times slower than Stampede 2. Based on their response, it looks like
FLASH is currently not well optimized for the new KNL nodes of Stampede
2 (https://portal.tacc.utexas.edu/user-guides/stampede2#bestpractices).
I'm not sure if anything else is missing -- I tested my simulation setup
on Stampede 1 and it worked just fine.
Are there any other recommendations or fixes? Has anybody else
successfully run large FLASH runs on Stampede 2?
Thanks,
Rukmani
> Hello,
>
> I have been experiencing a problem with my simulations hanging at random
> occasions while processing a refinement step (via gr_updateRefinement). The
> issue seems to be related to a mismatch between the processors sending and
> receiving block data in mpi_amr_redist_blk.F90. A stack trace shows that
> there is a single processor still making a call to send_block_data() while
> the rest have moved on to the subsequent MPI_ALLREDUCE() call (see excerpt
> below). The deadlock condition is repeatable: e.g. if it happened at step 3
> it will keep happening at the same point unless I change the grid structure
> by refining more or less. I have some of my own routines that are modifying
> the logical structures for marking blocks, as in refine(blockID) = .true.,
> but am not attempting to modify the grid through any other means. Is it
> possible there is a problem with my MPI setup? I am using FLASH4.4 on the
> new Stampede2 at TACC with 8 nodes by 48 processors each for 384 total
> processors.
>
> Thanks!
> -Jeremy
>
> flash4 000000000079E2D6 send_block_data_ 269
> send_block_data.F90
> flash4 000000000070C3E4 amr_redist_blk_ 674
> mpi_amr_redist_blk.F90
> flash4 00000000004FE560 amr_morton_order_ 164
> amr_morton_order.F90
> flash4 000000000071A7DA amr_refine_derefi 319
> mpi_amr_refine_derefine.F90
> flash4 00000000005D8B22 gr_updaterefineme 112
> gr_updateRefinement.F90
> flash4 0000000000457A10 grid_updaterefine 98
> Grid_updateRefinement.F90
> flash4 0000000000413F74 driver_evolveflas 390
> Driver_evolveFlash.F90
> flash4 000000000041DBB3 MAIN__ 51
> Flash.F90
>
> flash4 000000000070C50F amr_redist_blk_ 686
> mpi_amr_redist_blk.F90
> flash4 00000000004FE560 amr_morton_order_ 164
> amr_morton_order.F90
> flash4 000000000071A7DA amr_refine_derefi 319
> mpi_amr_refine_derefine.F90
> flash4 00000000005D8B22 gr_updaterefineme 112
> gr_updateRefinement.F90
> flash4 0000000000457A10 grid_updaterefine 98
> Grid_updateRefinement.F90
> flash4 0000000000413F74 driver_evolveflas 390
> Driver_evolveFlash.F90
> flash4 000000000041DBB3 MAIN__ 51
> Flash.F90
--
Rukmani Vijayaraghavan
NSF Astronomy & Astrophysics Postdoctoral Fellow
University of Virginia
rukmani at virginia.edu
More information about the flash-users
mailing list