[FLASH-USERS] mpi_amr_redist_blk has some processors hang at nrecv waitall

Mark Richardson mark.richardson.work at gmail.com
Sun Jun 17 19:47:42 EDT 2018


Hello,

  My current FLASH build worked fine on the original stampede, and on small local clusters. But on both KNL and SKX nodes on Stampede2, I get a hang during refinement in mpi_amr_redist_blk. If I build the initial simulation on a different cluster, then the hang happens on Stampede2 the first time the grid structure changes. If I build the initial simulation on Stampede2, it hangs after triggering level 6 in that initial Simulation_initBlk loop, but still in mpi_amr_redist_blk. 

Setup call: 
  ./setup -auto -3d -nxb=32 -nyb=16 -nzb=8 -maxblocks=200 species=rock,watr +uhd3tr mgd_meshgroups=1 Simulation_Buiild

  Using: ifort (IFORT) 17.0.4 20170411 

Log file tail in file Logfile.pdf

I’ve change maxblocks, and number of nodes, without getting out of this issue. 

I’ve changed the “iteration, no. not moved” output to occur for each processor, and they all print out the identical correct info. I’ve added per processor print statements before the nrecv>0 waitall and nsend>0 waitall in mpi_amr_redist_blk.F90 and see that about 25% of processors are waiting indefinitely in the nrecv>0 waitall, while the other 75% complete the resist_blk subroutine and are waiting later for the remaining processors to finish. 

I’ve tried adding sleep(1) inside the niter loop, as suggested in the past for someone who found niter going to 100 (note, I’m getting niter = 2 with no. not move=0, so all processors successfully exit that loop but hang later). This didn’t change the result.

Has anyone else seen similar hanging occurring, on any cluster? Any suggestions for overcoming this hang event? 

Thank you for your help,
  -Mark

 





-- 

Mark Richardson
MAT Postdoctoral Fellow
Department of Astrophysics
American Museum of Natural History
MRichardson at amnh.org
My Website <https://sites.google.com/site/marklarichardson/>
212 496 3432

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180617/8b0678dd/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Logfile.pdf
Type: application/pdf
Size: 18752 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180617/8b0678dd/attachment.pdf>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180617/8b0678dd/attachment-0001.htm>


More information about the flash-users mailing list