[FLASH-USERS] Timer runs out of space
Joshua Wall
joshua.e.wall at gmail.com
Wed Sep 12 13:42:37 EDT 2018
Hello Tomek,
Ah I see. Is this flag available in FLASH 4.2.2, and if so how would I set
it?
Also going to now push this back to the flash-users list so that others can
search for this in the future.
Josh
On Wed, Sep 12, 2018 at 1:28 PM Tomasz Plewa <tplewa at fsu.edu> wrote:
> Josh -
>
> Your problem sounded familiar. The issue is that a multigrid-based
> solution is obtained in a composite way with coarsening of mesh
> structure. This in turn implies participation of progressively fewer
> processes as the V-cycle sweeps from fine to coarse. So eventually a
> pool of processes not performing relaxations emerges and grows (it may
> also be non-empty from the very beginning).
>
> We "solved" this issue by passing a flag to timers that disables/enables
> timing of such problematic code sections.
>
> Tomek
> --
> On 09/12/18 13:08, Joshua Wall wrote:
> > Hello Klaus and Ryan,
> >
> > Thanks to a very helpful suggestion from Tomek, I modified my bit
> > of debugging code.
> > Now each processor will write out a statement whenever a new timer is
> > started (the timer
> > index increases) to a separate file for each processor. These files
> > can then be diff'ed to see
> > if any processor is creating a new timer that the others don't have.
> > I'll attach my new versions of
> > the files on this email. (I also added notes for the debugging text
> > just in case you'd like to merge this
> > into FLASH at any point)
> >
> > Using this, I've tracked my problem down to the fft solver timer
> > (which I promise I've made NO edits to!). Here's my vimdiff of the root
> > timer_debug0000.txt and timer_debug0119.txt:
> >
> > 218 [Timers_start]: Trying to start new timer
> > fft | 218 [Timers_start]:
> > Trying to start new timer fft
> > 219 [Timers_stop]: Trying to stop new timer
> > fft | 219 [Timers_stop]:
> > Trying to stop new timer fft
> > 220 [Timers_start]: Trying to start new timer
> > fft | 220 [Timers_start]:
> > Trying to start new timer fft
> > 221 [Timers_stop]: Trying to stop new timer
> > fft | 221 [Timers_stop]:
> > Trying to stop new timer fft
> > 222 [Timers_start]: Trying to start new timer
> > fft | 222 [Timers_start]:
> > Trying to start new timer fft
> > 223 [Timers_stop]: Trying to stop new timer
> > fft | 223 [Timers_stop]:
> > Trying to stop new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 224 [Timers_start]: Trying to start new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 225 [Timers_stop]: Trying to stop new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 226 [Timers_start]: Trying to start new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 227 [Timers_stop]: Trying to stop new timer fft
> > 224 [Timers_start]: Trying to start new timer
> > gr_hgBndry | 228 [Timers_start]:
> > Trying to start new timer gr_hgBndry
> > 225 [Timers_start]: Trying to start new timer work
> > copy | 229 [Timers_start]: Trying
> > to start new timer work copy
> > 226 [Timers_stop]: Trying to stop new timer work
> > copy | 230 [Timers_stop]: Trying
> > to stop new timer work copy
> > 227 [Timers_start]: Trying to start new timer
> > gr_hgGuardCell | 231 [Timers_start]:
> > Trying to start new timer gr_hgGuardCell
> > 228 [Timers_stop]: Trying to stop new timer
> > gr_hgGuardCell | 232 [Timers_stop]:
> > Trying to stop new timer gr_hgGuardCell
> > 229 [Timers_start]: Trying to start new timer work
> > copy | 233 [Timers_start]: Trying
> > to start new timer work copy
> > + 230 +--309 lines: [Timers_stop]: Trying to stop new timer work
> > copy----------------------|+ 234 +--309 lines: [Timers_stop]: Trying
> > to stop new timer work copy---------------------
> > 539 [Timers_start]: Trying to start new timer
> > fft | 543 [Timers_start]:
> > Trying to start new timer fft
> > 540 [Timers_stop]: Trying to stop new timer
> > fft | 544 [Timers_stop]:
> > Trying to stop new timer fft
> > 541 [Timers_start]: Trying to start new timer
> > fft | 545 [Timers_start]:
> > Trying to start new timer fft
> > 542 [Timers_stop]: Trying to stop new timer
> > fft | 546 [Timers_stop]:
> > Trying to stop new timer fft
> > 543 [Timers_start]: Trying to start new timer
> > fft | 547 [Timers_start]:
> > Trying to start new timer fft
> > 544 [Timers_stop]: Trying to stop new timer
> > fft | 548 [Timers_stop]:
> > Trying to stop new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 549 [Timers_start]: Trying to start new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 550 [Timers_stop]: Trying to stop new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 551 [Timers_start]: Trying to start new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 552 [Timers_stop]: Trying to stop new timer fft
> > 545 [Timers_start]: Trying to start new timer
> > gr_hgBndry | 553 [Timers_start]:
> > Trying to start new timer gr_hgBndry
> > 546 [Timers_start]: Trying to start new timer work
> > copy | 554 [Timers_start]: Trying
> > to start new timer work copy
> > 547 [Timers_stop]: Trying to stop new timer work
> > copy | 555 [Timers_stop]: Trying
> > to stop new timer work copy
> > 548 [Timers_start]: Trying to start new timer
> > gr_hgGuardCell | 556 [Timers_start]:
> > Trying to start new timer gr_hgGuardCell
> > 549 [Timers_stop]: Trying to stop new timer
> > gr_hgGuardCell | 557 [Timers_stop]:
> > Trying to stop new timer gr_hgGuardCell
> > 550 [Timers_start]: Trying to start new timer work
> > copy | 558 [Timers_start]: Trying
> > to start new timer work copy
> >
> > And grep'ing for call Timers_start("fft") I see:
> >
> > josh at iris2:~/flash/src/flash4.2.2-rad/object$ grep -iIn 'call
> > Timers_start("fft")' *
> > gr_hgSolveLevel.F90:176: call Timers_start("fft")
> > gr_hgSolveLevel.F90:194: call Timers_start("fft") !trick to keep
> > timers structure on different procs the same - KW
> >
> > So I'm hoping KW remembers the trick? :)
> >
> > Cordially,
> >
> > Josh
> >
> > On Wed, Sep 12, 2018 at 11:35 AM Klaus Weide <klaus at flash.uchicago.edu
> > <mailto:klaus at flash.uchicago.edu>> wrote:
> >
> > On Tue, 11 Sep 2018, Joshua Wall wrote:
> >
> > > Ryan,
> > >
> > > Another clue I just found that probably supports Klaus's
> > theory of some
> > > timer started that didn't get stopped is the following I found
> > in my log
> > > file:
> > >
> > > [ 09-10-2018 23:31:07.824 ] [Timers_getSummary]: Not writing timer
> > > max/min/avg values because not all processors had same timers
> >
> > This can happen if not all procs execute the same code (in
> > particular: the
> > same Timers calls). In particular, if you are running on more procs
> > than you have blocks, especially in the initial step of a simulation.
> > By itself this should be harmless and not cause the other problems
> > reported.
> >
> > > In case you find the same. To be clear on what I'm running with
> > exactly:
> > >
> > > Hydro: USM
> > > Grav: MG
> > > +cube16
> > > AMR
> > > +pm4dev
> > > +supportPPMupwind
> > > maxblocks = 50
> > > -a -3d
> > > optimization: -03
> > > MPI: OpenMPI 1.10.02
> > > compiler: gnu 4.8.0 (with the -O0 fix for mpi_amr_1blk_guardcell.o)
> > >
> > >
> > > Klaus,
> > >
> > > Would it be possible to call Timer_getSummary() after each
> > call to the
> > > individual units in Driver_evolveFlash in an attempt to find the
> > unit
> > > responsible by "bisection"? Essentially the same log message as
> > above
> > > should print right after some unit gets the timers on the different
> > > processors out of sync. Does that seem okay to try?
> >
> >
> > I don't know whether Timers_getSummary behaves well if it is
> > called more
> > than once. Normally, as you know, it is called exactly once, at
> > the end of
> > a run. You are free to experiment, of course!
> >
> >
> > The following may be also useful:
> >
> > You should be able to completely disable the Timers code but
> > setting up
> > using the following in your setup command:
> >
> > --without-unit=monitors/Timers
> >
> > That should allow you to test the remainder of your code without
> > interference from improperly nested Timers calls.
> >
> > Klaus
> >
> > --
> > Joshua Wall
> > Doctoral Candidate
> > Department of Physics
> > Drexel University
> > 3141 Chestnut Street
> > Philadelphia, PA 19104
>
> --
Joshua Wall
Doctoral Candidate
Department of Physics
Drexel University
3141 Chestnut Street
Philadelphia, PA 19104
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180912/7367d54d/attachment-0001.htm>
More information about the flash-users
mailing list