[FLASH-USERS] mpi_amr_redist_blk has some processors hang at nrecv waitall

Yingchao Lu yingchao.lu at gmail.com
Thu Jun 21 10:54:10 EDT 2018


Hi Mark,

I had the same issue and a TACC staff helped me out. Try to load the
modules using the following script.
```
export LMOD_EXPERT=1
module purge
module load intel/17.0.4
module load mvapich2
module use /opt/apps/intel17/impi17_0/modulefiles
module load petsc/3.7
module use /opt/apps/intel17/impi17_0/modulefiles
module load phdf5
```

Hope this helps!

Yingchao


On Thu, Jun 21, 2018 at 8:37 AM Mark Richardson <
mark.richardson.work at gmail.com> wrote:

> Hi Josh,
>   Thanks for your suggestions. I have turned on debug with -O0 and -g, and
> it didn't affect the outcome. All gcc compilers on stampede2 are version >
> 5, so I might try and install an earlier version. Further, their mpif90 etc
> is built with intel, while my builds that work elsewhere are all gnu built.
>
> I have maxblocks set to 200, but I am only trying to allocate an average
> of 63 blocks per processor. I have played with the number of processors per
> node and number of node, effectively changing both the average blocks per
> processor being allocated, and the memory available to each processor.
> Neither averted the hang up.
>
> Thanks again,
>   -Mark
>
>
> On 18 June 2018 at 07:59, Joshua Wall <joshua.e.wall at gmail.com> wrote:
>
>> Hello Mark,
>>
>>     I seem to remember some issue with the code hanging that was due to
>> optimization with newer versions of the GCC compiler suite. Indeed, I think
>> I also implemented this to get past a hang:
>>
>> http://flash.uchicago.edu/pipermail/flash-users/2015-February/001637.html
>>
>>
>> Does attempting this fix (or running with -O0 ) help at all?
>>
>> As a final note, I have also seen the code hang silently when in fact I
>> had either 1) exceeded the maximum number of blocks per processor or 2) run
>> out of RAM on a node. So those are things to check as well.
>>
>> Hope that helps!
>>
>> Josh
>>
>> On Mon, Jun 18, 2018 at 1:47 AM Mark Richardson <
>> mark.richardson.work at gmail.com> wrote:
>>
>>> Hello,
>>>
>>>   My current FLASH build worked fine on the original stampede, and on
>>> small local clusters. But on both KNL and SKX nodes on Stampede2, I get a
>>> hang during refinement in mpi_amr_redist_blk. If I build the initial
>>> simulation on a different cluster, then the hang happens on Stampede2 the
>>> first time the grid structure changes. If I build the initial simulation on
>>> Stampede2, it hangs after triggering level 6 in that initial
>>> Simulation_initBlk loop, but still in mpi_amr_redist_blk.
>>>
>>> Setup call:
>>>   ./setup -auto -3d -nxb=32 -nyb=16 -nzb=8 -maxblocks=200
>>> species=rock,watr +uhd3tr mgd_meshgroups=1 Simulation_Buiild
>>>
>>>   Using: ifort (IFORT) 17.0.4 20170411
>>>
>>> Log file tail in file Logfile.pdf
>>>
>>> I’ve change maxblocks, and number of nodes, without getting out of this
>>> issue.
>>>
>>> I’ve changed the “iteration, no. not moved” output to occur for each
>>> processor, and they all print out the identical correct info. I’ve added
>>> per processor print statements before the nrecv>0 waitall and nsend>0
>>> waitall in mpi_amr_redist_blk.F90 and see that about 25% of processors are
>>> waiting indefinitely in the nrecv>0 waitall, while the other 75% complete
>>> the resist_blk subroutine and are waiting later for the remaining
>>> processors to finish.
>>>
>>> I’ve tried adding sleep(1) inside the niter loop, as suggested in the
>>> past for someone who found niter going to 100 (note, I’m getting niter = 2
>>> with no. not move=0, so all processors successfully exit that loop but hang
>>> later). This didn’t change the result.
>>>
>>> Has anyone else seen similar hanging occurring, on any cluster? Any
>>> suggestions for overcoming this hang event?
>>>
>>> Thank you for your help,
>>>   -Mark
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Mark Richardson
>>> MAT Postdoctoral Fellow
>>> Department of Astrophysics
>>> American Museum of Natural History
>>> MRichardson at amnh.org
>>> My Website <https://sites.google.com/site/marklarichardson/>
>>> 212 496 3432 <(212)%20496-3432>
>>>
>>> --
>> Joshua Wall
>> Doctoral Candidate
>> Department of Physics
>> Drexel University
>> 3141 Chestnut Street
>> <https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
>> Philadelphia, PA 19104
>> <https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
>>
>
>
>
> --
>
> Mark Richardson
> MAT Postdoctoral Fellow
> Department of Astrophysics
> American Museum of Natural History
> Mark.Richardson.Work at gmail.com
> My Website <https://sites.google.com/site/marklarichardson/>
> 212 496 3432
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180621/d53a90d8/attachment-0001.htm>


More information about the flash-users mailing list