[FLASH-USERS] Bugs in in the GridParticles Module

Yi-Hao Chen ychen at astro.wisc.edu
Wed May 8 16:41:33 EDT 2019


Dear FLASH Developers and Users,

I have encountered an MPI hanging problem after the redistribution of blocks. In this case, the last line of the output message is "refined: total blocks = XXXXX". After some investigation, I have found two potential bugs in the process of passing particles between processors. I will document the details below. Hope this will help other people encountering similar problems.

1. Hanging when the number of particles exceeds the limit
Relevant file:
Grid/GridParticles/gr_ptHandleExcess.F90

This subroutine is called in gr_ptLocalMatch, which is called in Grid_moveParticles or gr_ptMoveSieve. The problem is that if the option to remove excess particles is disabled (which is the default, gr_ptRemove = .false.), it should write a checkpoint file and abort the simulation. However, if this happens during gr_ptMoveSieve, only the processor with particle number over the limit will call IO_writeCheckpoint while the rest of the processors will continue the while loop and will be waiting for MPI_ALLREDUCE in gr_ptNextProcPair or MPI_SENDRECV in gr_ptMoveSieve to complete. This causes the whole simulation to hang.


2. A (minor) bug that might cause unnecessary communication
Relevant files:
Grid/GridParticles/GridParticlesMove/Sieve/gr_ptMoveSieve.F90
Grid/GridParticles/GridParticlesMove/Sieve/BlockMatch/gr_ptResetProcPair.F90
Grid/GridParticles/GridParticlesMove/Sieve/BlockMatch/gr_ptNextProcPair.F90

This seems to be a minor bug and will likely cause additional communications between processors. Since the particles module does not use a large portion of overall time, it might only affect the performance a little.
This bug has to do with the use of gr_ptSieveCheckFreq and gr_ptSieveFreq. The former is an input parameter but also serves as a counter in gr_ptNextProcPair. In gr_ptMoveSieve, gr_ptSieveFreq is set to gr_ptSieveCheckFreq. Then gr_ptSieveCheckFreq is set to gr_ptSieveFreq+1 in gr_ptResetProcPair and will decrease by 1 for every call to in gr_ptNextProcPair. However, if communication is not needed, gr_ptNextProcPair will not be called and gr_ptSieveCheckFreq will keep increasing for later timesteps in the simulation. See the following excerpts of codes for details. The result is that at a later time, all processors will keep communicating until timesInLoop==gr_meshNumProcs. I guess this is not intended behavior.

gr_ptSieveCheckFreq = 1 (default)

In gr_ptMoveSieve:
gr_ptSieveFreq=gr_ptSieveCheckFreq
call gr_ptResetProcPair
gr_ptSieveCheckFreq=gr_ptSieveFreq+1
do while (mustCommunicate)
call gr_ptNextProcPair
gr_ptSieveCheckFreq=gr_ptSieveCheckFreq-1
if((gr_ptSieveCheckFreq==0).or.(timesInLoop==gr_meshNumProcs)) then
gr_ptSieveCheckFreq=gr_ptSieveFreq

I would appreciate any comments you have. Hopefully these will be addressed by the FLASH team. Thank you very much for reading this long email.

Sincerely,
Yi-Hao



<mailto:ychen at astro.wisc.edu>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20190508/af4ba80e/attachment.htm>


More information about the flash-users mailing list