[FLASH-USERS] mpi_amr_redist_blk has some processors hang at nrecv waitall

Sean M. Couch couch at pa.msu.edu
Thu Jun 21 10:45:55 EDT 2018


Hi Mark,

You might also try adding `useFortran2003=True` to your setup line. On Intel compilers, at least, this will spit out some useful memory usage statistics toward the beginning of the .log file.

Best of luck,
Sean


---------------------------------------------------------------------------------------------

Sean M. Couch

Assistant Professor

Department of Physics and Astronomy

Department of Computational Mathematics, Science, and Engineering

National Superconducting Cyclotron Laboratory/Facility for Rare Isotope Beams

Michigan State University

567 Wilson Rd, 3250 BPS

East Lansing, MI 48824

(517) 884-5035 —— couch at pa.msu.edu<mailto:couch at pa.msu.edu> —— www.pa.msu.edu/~couch<http://www.pa.msu.edu/~couch>

On Jun 21, 2018, 10:37 AM -0400, Mark Richardson <mark.richardson.work at gmail.com>, wrote:
Hi Josh,
  Thanks for your suggestions. I have turned on debug with -O0 and -g, and it didn't affect the outcome. All gcc compilers on stampede2 are version > 5, so I might try and install an earlier version. Further, their mpif90 etc is built with intel, while my builds that work elsewhere are all gnu built.

I have maxblocks set to 200, but I am only trying to allocate an average of 63 blocks per processor. I have played with the number of processors per node and number of node, effectively changing both the average blocks per processor being allocated, and the memory available to each processor. Neither averted the hang up.

Thanks again,
  -Mark


On 18 June 2018 at 07:59, Joshua Wall <joshua.e.wall at gmail.com<mailto:joshua.e.wall at gmail.com>> wrote:
Hello Mark,

    I seem to remember some issue with the code hanging that was due to optimization with newer versions of the GCC compiler suite. Indeed, I think I also implemented this to get past a hang:

http://flash.uchicago.edu/pipermail/flash-users/2015-February/001637.html

Does attempting this fix (or running with -O0 ) help at all?

As a final note, I have also seen the code hang silently when in fact I had either 1) exceeded the maximum number of blocks per processor or 2) run out of RAM on a node. So those are things to check as well.

Hope that helps!

Josh

On Mon, Jun 18, 2018 at 1:47 AM Mark Richardson <mark.richardson.work at gmail.com<mailto:mark.richardson.work at gmail.com>> wrote:
Hello,

  My current FLASH build worked fine on the original stampede, and on small local clusters. But on both KNL and SKX nodes on Stampede2, I get a hang during refinement in mpi_amr_redist_blk. If I build the initial simulation on a different cluster, then the hang happens on Stampede2 the first time the grid structure changes. If I build the initial simulation on Stampede2, it hangs after triggering level 6 in that initial Simulation_initBlk loop, but still in mpi_amr_redist_blk.

Setup call:
  ./setup -auto -3d -nxb=32 -nyb=16 -nzb=8 -maxblocks=200 species=rock,watr +uhd3tr mgd_meshgroups=1 Simulation_Buiild

  Using: ifort (IFORT) 17.0.4 20170411

Log file tail in file Logfile.pdf

I’ve change maxblocks, and number of nodes, without getting out of this issue.

I’ve changed the “iteration, no. not moved” output to occur for each processor, and they all print out the identical correct info. I’ve added per processor print statements before the nrecv>0 waitall and nsend>0 waitall in mpi_amr_redist_blk.F90 and see that about 25% of processors are waiting indefinitely in the nrecv>0 waitall, while the other 75% complete the resist_blk subroutine and are waiting later for the remaining processors to finish.

I’ve tried adding sleep(1) inside the niter loop, as suggested in the past for someone who found niter going to 100 (note, I’m getting niter = 2 with no. not move=0, so all processors successfully exit that loop but hang later). This didn’t change the result.

Has anyone else seen similar hanging occurring, on any cluster? Any suggestions for overcoming this hang event?

Thank you for your help,
  -Mark







--

Mark Richardson
MAT Postdoctoral Fellow
Department of Astrophysics
American Museum of Natural History
MRichardson at amnh.org<mailto:MRichardson at amnh.org>
My Website<https://sites.google.com/site/marklarichardson/>
212 496 3432<tel:(212)%20496-3432>

--
Joshua Wall
Doctoral Candidate
Department of Physics
Drexel University
3141 Chestnut Street<https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
Philadelphia, PA 19104<https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>



--

Mark Richardson
MAT Postdoctoral Fellow
Department of Astrophysics
American Museum of Natural History
Mark.Richardson.Work at gmail.com<mailto:Mark.Richardson.Work at gmail.com>
My Website<https://sites.google.com/site/marklarichardson/>
212 496 3432
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180621/7cef50d5/attachment.htm>


More information about the flash-users mailing list