<div dir="ltr"><div>Hello Tomek,</div><div><br></div><div>Ah I see. Is this flag available in FLASH 4.2.2, and if so how would I set it?</div><div><br></div><div>Also going to now push this back to the flash-users list so that others can search for this in the future.</div><div><br></div><div>Josh<br></div><br><div class="gmail_quote"><div dir="ltr">On Wed, Sep 12, 2018 at 1:28 PM Tomasz Plewa <<a href="mailto:tplewa@fsu.edu">tplewa@fsu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Josh -<br>
<br>
Your problem sounded familiar. The issue is that a multigrid-based <br>
solution is obtained in a composite way with coarsening of mesh <br>
structure. This in turn implies participation of progressively fewer <br>
processes as the V-cycle sweeps from fine to coarse. So eventually a <br>
pool of processes not performing relaxations emerges and grows (it may <br>
also be non-empty from the very beginning).<br>
<br>
We "solved" this issue by passing a flag to timers that disables/enables <br>
timing of such problematic code sections.<br>
<br>
Tomek<br>
--<br>
On 09/12/18 13:08, Joshua Wall wrote:<br>
> Hello Klaus and Ryan,<br>
><br>
> Thanks to a very helpful suggestion from Tomek, I modified my bit <br>
> of debugging code.<br>
> Now each processor will write out a statement whenever a new timer is <br>
> started (the timer<br>
> index increases) to a separate file for each processor. These files <br>
> can then be diff'ed to see<br>
> if any processor is creating a new timer that the others don't have. <br>
> I'll attach my new versions of<br>
> the files on this email. (I also added notes for the debugging text <br>
> just in case you'd like to merge this<br>
> into FLASH at any point)<br>
><br>
> Using this, I've tracked my problem down to the fft solver timer <br>
> (which I promise I've made NO edits to!). Here's my vimdiff of the root<br>
> timer_debug0000.txt and timer_debug0119.txt:<br>
><br>
> 218 [Timers_start]: Trying to start new timer <br>
> fft | 218 [Timers_start]: <br>
> Trying to start new timer fft<br>
> 219 [Timers_stop]: Trying to stop new timer <br>
> fft | 219 [Timers_stop]: <br>
> Trying to stop new timer fft<br>
> 220 [Timers_start]: Trying to start new timer <br>
> fft | 220 [Timers_start]: <br>
> Trying to start new timer fft<br>
> 221 [Timers_stop]: Trying to stop new timer <br>
> fft | 221 [Timers_stop]: <br>
> Trying to stop new timer fft<br>
> 222 [Timers_start]: Trying to start new timer <br>
> fft | 222 [Timers_start]: <br>
> Trying to start new timer fft<br>
> 223 [Timers_stop]: Trying to stop new timer <br>
> fft | 223 [Timers_stop]: <br>
> Trying to stop new timer fft<br>
> -------------------------------------------------------------------------------------| <br>
> 224 [Timers_start]: Trying to start new timer fft<br>
> -------------------------------------------------------------------------------------| <br>
> 225 [Timers_stop]: Trying to stop new timer fft<br>
> -------------------------------------------------------------------------------------| <br>
> 226 [Timers_start]: Trying to start new timer fft<br>
> -------------------------------------------------------------------------------------| <br>
> 227 [Timers_stop]: Trying to stop new timer fft<br>
> 224 [Timers_start]: Trying to start new timer <br>
> gr_hgBndry | 228 [Timers_start]: <br>
> Trying to start new timer gr_hgBndry<br>
> 225 [Timers_start]: Trying to start new timer work <br>
> copy | 229 [Timers_start]: Trying <br>
> to start new timer work copy<br>
> 226 [Timers_stop]: Trying to stop new timer work <br>
> copy | 230 [Timers_stop]: Trying <br>
> to stop new timer work copy<br>
> 227 [Timers_start]: Trying to start new timer <br>
> gr_hgGuardCell | 231 [Timers_start]: <br>
> Trying to start new timer gr_hgGuardCell<br>
> 228 [Timers_stop]: Trying to stop new timer <br>
> gr_hgGuardCell | 232 [Timers_stop]: <br>
> Trying to stop new timer gr_hgGuardCell<br>
> 229 [Timers_start]: Trying to start new timer work <br>
> copy | 233 [Timers_start]: Trying <br>
> to start new timer work copy<br>
> + 230 +--309 lines: [Timers_stop]: Trying to stop new timer work <br>
> copy----------------------|+ 234 +--309 lines: [Timers_stop]: Trying <br>
> to stop new timer work copy---------------------<br>
> 539 [Timers_start]: Trying to start new timer <br>
> fft | 543 [Timers_start]: <br>
> Trying to start new timer fft<br>
> 540 [Timers_stop]: Trying to stop new timer <br>
> fft | 544 [Timers_stop]: <br>
> Trying to stop new timer fft<br>
> 541 [Timers_start]: Trying to start new timer <br>
> fft | 545 [Timers_start]: <br>
> Trying to start new timer fft<br>
> 542 [Timers_stop]: Trying to stop new timer <br>
> fft | 546 [Timers_stop]: <br>
> Trying to stop new timer fft<br>
> 543 [Timers_start]: Trying to start new timer <br>
> fft | 547 [Timers_start]: <br>
> Trying to start new timer fft<br>
> 544 [Timers_stop]: Trying to stop new timer <br>
> fft | 548 [Timers_stop]: <br>
> Trying to stop new timer fft<br>
> -------------------------------------------------------------------------------------| <br>
> 549 [Timers_start]: Trying to start new timer fft<br>
> -------------------------------------------------------------------------------------| <br>
> 550 [Timers_stop]: Trying to stop new timer fft<br>
> -------------------------------------------------------------------------------------| <br>
> 551 [Timers_start]: Trying to start new timer fft<br>
> -------------------------------------------------------------------------------------| <br>
> 552 [Timers_stop]: Trying to stop new timer fft<br>
> 545 [Timers_start]: Trying to start new timer <br>
> gr_hgBndry | 553 [Timers_start]: <br>
> Trying to start new timer gr_hgBndry<br>
> 546 [Timers_start]: Trying to start new timer work <br>
> copy | 554 [Timers_start]: Trying <br>
> to start new timer work copy<br>
> 547 [Timers_stop]: Trying to stop new timer work <br>
> copy | 555 [Timers_stop]: Trying <br>
> to stop new timer work copy<br>
> 548 [Timers_start]: Trying to start new timer <br>
> gr_hgGuardCell | 556 [Timers_start]: <br>
> Trying to start new timer gr_hgGuardCell<br>
> 549 [Timers_stop]: Trying to stop new timer <br>
> gr_hgGuardCell | 557 [Timers_stop]: <br>
> Trying to stop new timer gr_hgGuardCell<br>
> 550 [Timers_start]: Trying to start new timer work <br>
> copy | 558 [Timers_start]: Trying <br>
> to start new timer work copy<br>
><br>
> And grep'ing for call Timers_start("fft") I see:<br>
><br>
> josh@iris2:~/flash/src/flash4.2.2-rad/object$ grep -iIn 'call <br>
> Timers_start("fft")' *<br>
> gr_hgSolveLevel.F90:176: call Timers_start("fft")<br>
> gr_hgSolveLevel.F90:194: call Timers_start("fft") !trick to keep <br>
> timers structure on different procs the same - KW<br>
><br>
> So I'm hoping KW remembers the trick? :)<br>
><br>
> Cordially,<br>
><br>
> Josh<br>
><br>
> On Wed, Sep 12, 2018 at 11:35 AM Klaus Weide <<a href="mailto:klaus@flash.uchicago.edu" target="_blank">klaus@flash.uchicago.edu</a> <br>
> <mailto:<a href="mailto:klaus@flash.uchicago.edu" target="_blank">klaus@flash.uchicago.edu</a>>> wrote:<br>
><br>
> On Tue, 11 Sep 2018, Joshua Wall wrote:<br>
><br>
> > Ryan,<br>
> ><br>
> > Another clue I just found that probably supports Klaus's<br>
> theory of some<br>
> > timer started that didn't get stopped is the following I found<br>
> in my log<br>
> > file:<br>
> ><br>
> > [ 09-10-2018 23:31:07.824 ] [Timers_getSummary]: Not writing timer<br>
> > max/min/avg values because not all processors had same timers<br>
><br>
> This can happen if not all procs execute the same code (in<br>
> particular: the<br>
> same Timers calls). In particular, if you are running on more procs<br>
> than you have blocks, especially in the initial step of a simulation.<br>
> By itself this should be harmless and not cause the other problems<br>
> reported.<br>
><br>
> > In case you find the same. To be clear on what I'm running with<br>
> exactly:<br>
> ><br>
> > Hydro: USM<br>
> > Grav: MG<br>
> > +cube16<br>
> > AMR<br>
> > +pm4dev<br>
> > +supportPPMupwind<br>
> > maxblocks = 50<br>
> > -a -3d<br>
> > optimization: -03<br>
> > MPI: OpenMPI 1.10.02<br>
> > compiler: gnu 4.8.0 (with the -O0 fix for mpi_amr_1blk_guardcell.o)<br>
> ><br>
> ><br>
> > Klaus,<br>
> ><br>
> > Would it be possible to call Timer_getSummary() after each<br>
> call to the<br>
> > individual units in Driver_evolveFlash in an attempt to find the<br>
> unit<br>
> > responsible by "bisection"? Essentially the same log message as<br>
> above<br>
> > should print right after some unit gets the timers on the different<br>
> > processors out of sync. Does that seem okay to try?<br>
><br>
><br>
> I don't know whether Timers_getSummary behaves well if it is<br>
> called more<br>
> than once. Normally, as you know, it is called exactly once, at<br>
> the end of<br>
> a run. You are free to experiment, of course!<br>
><br>
><br>
> The following may be also useful:<br>
><br>
> You should be able to completely disable the Timers code but<br>
> setting up<br>
> using the following in your setup command:<br>
><br>
> --without-unit=monitors/Timers<br>
><br>
> That should allow you to test the remainder of your code without<br>
> interference from improperly nested Timers calls.<br>
><br>
> Klaus<br>
><br>
> -- <br>
> Joshua Wall<br>
> Doctoral Candidate<br>
> Department of Physics<br>
> Drexel University<br>
> 3141 Chestnut Street<br>
> Philadelphia, PA 19104<br>
<br>
</blockquote></div></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div><div><div><div>Joshua Wall<br></div>Doctoral Candidate<br></div>Department of Physics<br></div>Drexel University<br></div>3141 Chestnut Street<br></div>Philadelphia, PA 19104<br></div></div>