[FLASH-USERS] mpi_amr_redist_blk has some processors hang at nrecv waitall

Mark Richardson mark.richardson.work at gmail.com
Thu Jun 21 10:37:14 EDT 2018


Hi Josh,
  Thanks for your suggestions. I have turned on debug with -O0 and -g, and
it didn't affect the outcome. All gcc compilers on stampede2 are version >
5, so I might try and install an earlier version. Further, their mpif90 etc
is built with intel, while my builds that work elsewhere are all gnu built.

I have maxblocks set to 200, but I am only trying to allocate an average of
63 blocks per processor. I have played with the number of processors per
node and number of node, effectively changing both the average blocks per
processor being allocated, and the memory available to each processor.
Neither averted the hang up.

Thanks again,
  -Mark


On 18 June 2018 at 07:59, Joshua Wall <joshua.e.wall at gmail.com> wrote:

> Hello Mark,
>
>     I seem to remember some issue with the code hanging that was due to
> optimization with newer versions of the GCC compiler suite. Indeed, I think
> I also implemented this to get past a hang:
>
> http://flash.uchicago.edu/pipermail/flash-users/2015-February/001637.html
>
> Does attempting this fix (or running with -O0 ) help at all?
>
> As a final note, I have also seen the code hang silently when in fact I
> had either 1) exceeded the maximum number of blocks per processor or 2) run
> out of RAM on a node. So those are things to check as well.
>
> Hope that helps!
>
> Josh
>
> On Mon, Jun 18, 2018 at 1:47 AM Mark Richardson <
> mark.richardson.work at gmail.com> wrote:
>
>> Hello,
>>
>>   My current FLASH build worked fine on the original stampede, and on
>> small local clusters. But on both KNL and SKX nodes on Stampede2, I get a
>> hang during refinement in mpi_amr_redist_blk. If I build the initial
>> simulation on a different cluster, then the hang happens on Stampede2 the
>> first time the grid structure changes. If I build the initial simulation on
>> Stampede2, it hangs after triggering level 6 in that initial
>> Simulation_initBlk loop, but still in mpi_amr_redist_blk.
>>
>> Setup call:
>>   ./setup -auto -3d -nxb=32 -nyb=16 -nzb=8 -maxblocks=200
>> species=rock,watr +uhd3tr mgd_meshgroups=1 Simulation_Buiild
>>
>>   Using: ifort (IFORT) 17.0.4 20170411
>>
>> Log file tail in file Logfile.pdf
>>
>> I’ve change maxblocks, and number of nodes, without getting out of this
>> issue.
>>
>> I’ve changed the “iteration, no. not moved” output to occur for each
>> processor, and they all print out the identical correct info. I’ve added
>> per processor print statements before the nrecv>0 waitall and nsend>0
>> waitall in mpi_amr_redist_blk.F90 and see that about 25% of processors are
>> waiting indefinitely in the nrecv>0 waitall, while the other 75% complete
>> the resist_blk subroutine and are waiting later for the remaining
>> processors to finish.
>>
>> I’ve tried adding sleep(1) inside the niter loop, as suggested in the
>> past for someone who found niter going to 100 (note, I’m getting niter = 2
>> with no. not move=0, so all processors successfully exit that loop but hang
>> later). This didn’t change the result.
>>
>> Has anyone else seen similar hanging occurring, on any cluster? Any
>> suggestions for overcoming this hang event?
>>
>> Thank you for your help,
>>   -Mark
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Mark Richardson
>> MAT Postdoctoral Fellow
>> Department of Astrophysics
>> American Museum of Natural History
>> MRichardson at amnh.org
>> My Website <https://sites.google.com/site/marklarichardson/>
>> 212 496 3432 <(212)%20496-3432>
>>
>> --
> Joshua Wall
> Doctoral Candidate
> Department of Physics
> Drexel University
> 3141 Chestnut Street
> <https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
> Philadelphia, PA 19104
> <https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
>



-- 

Mark Richardson
MAT Postdoctoral Fellow
Department of Astrophysics
American Museum of Natural History
Mark.Richardson.Work at gmail.com
My Website <https://sites.google.com/site/marklarichardson/>
212 496 3432
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180621/f79e348e/attachment.htm>


More information about the flash-users mailing list