[FLASH-USERS] Timer runs out of space
Tomasz Plewa
tplewa at fsu.edu
Wed Sep 12 13:51:52 EDT 2018
Hi Josh -
We have that implemented in our local version of FLASH, but it can be
easily added to the mainstream version. The logical flag is a part of
the timers module, so it is globally accessible to all processes. Then
one only needs to decide which code sections to exclude from timing, and
if such exclusion is actually necessary.
Hope this helps -
Tomek
--
On 09/12/18 13:42, Joshua Wall wrote:
> Hello Tomek,
>
> Ah I see. Is this flag available in FLASH 4.2.2, and if so how would I
> set it?
>
> Also going to now push this back to the flash-users list so that
> others can search for this in the future.
>
> Josh
>
> On Wed, Sep 12, 2018 at 1:28 PM Tomasz Plewa <tplewa at fsu.edu
> <mailto:tplewa at fsu.edu>> wrote:
>
> Josh -
>
> Your problem sounded familiar. The issue is that a multigrid-based
> solution is obtained in a composite way with coarsening of mesh
> structure. This in turn implies participation of progressively fewer
> processes as the V-cycle sweeps from fine to coarse. So eventually a
> pool of processes not performing relaxations emerges and grows (it
> may
> also be non-empty from the very beginning).
>
> We "solved" this issue by passing a flag to timers that
> disables/enables
> timing of such problematic code sections.
>
> Tomek
> --
> On 09/12/18 13:08, Joshua Wall wrote:
> > Hello Klaus and Ryan,
> >
> > Thanks to a very helpful suggestion from Tomek, I modified
> my bit
> > of debugging code.
> > Now each processor will write out a statement whenever a new
> timer is
> > started (the timer
> > index increases) to a separate file for each processor. These files
> > can then be diff'ed to see
> > if any processor is creating a new timer that the others don't
> have.
> > I'll attach my new versions of
> > the files on this email. (I also added notes for the debugging text
> > just in case you'd like to merge this
> > into FLASH at any point)
> >
> > Using this, I've tracked my problem down to the fft solver timer
> > (which I promise I've made NO edits to!). Here's my vimdiff of
> the root
> > timer_debug0000.txt and timer_debug0119.txt:
> >
> > 218 [Timers_start]: Trying to start new timer
> > fft | 218 [Timers_start]:
> > Trying to start new timer fft
> > 219 [Timers_stop]: Trying to stop new timer
> > fft | 219
> [Timers_stop]:
> > Trying to stop new timer fft
> > 220 [Timers_start]: Trying to start new timer
> > fft | 220 [Timers_start]:
> > Trying to start new timer fft
> > 221 [Timers_stop]: Trying to stop new timer
> > fft | 221
> [Timers_stop]:
> > Trying to stop new timer fft
> > 222 [Timers_start]: Trying to start new timer
> > fft | 222 [Timers_start]:
> > Trying to start new timer fft
> > 223 [Timers_stop]: Trying to stop new timer
> > fft | 223
> [Timers_stop]:
> > Trying to stop new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 224 [Timers_start]: Trying to start new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 225 [Timers_stop]: Trying to stop new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 226 [Timers_start]: Trying to start new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 227 [Timers_stop]: Trying to stop new timer fft
> > 224 [Timers_start]: Trying to start new timer
> > gr_hgBndry | 228 [Timers_start]:
> > Trying to start new timer gr_hgBndry
> > 225 [Timers_start]: Trying to start new timer work
> > copy | 229 [Timers_start]:
> Trying
> > to start new timer work copy
> > 226 [Timers_stop]: Trying to stop new timer work
> > copy | 230 [Timers_stop]:
> Trying
> > to stop new timer work copy
> > 227 [Timers_start]: Trying to start new timer
> > gr_hgGuardCell | 231 [Timers_start]:
> > Trying to start new timer gr_hgGuardCell
> > 228 [Timers_stop]: Trying to stop new timer
> > gr_hgGuardCell | 232
> [Timers_stop]:
> > Trying to stop new timer gr_hgGuardCell
> > 229 [Timers_start]: Trying to start new timer work
> > copy | 233 [Timers_start]:
> Trying
> > to start new timer work copy
> > + 230 +--309 lines: [Timers_stop]: Trying to stop new timer work
> > copy----------------------|+ 234 +--309 lines: [Timers_stop]:
> Trying
> > to stop new timer work copy---------------------
> > 539 [Timers_start]: Trying to start new timer
> > fft | 543 [Timers_start]:
> > Trying to start new timer fft
> > 540 [Timers_stop]: Trying to stop new timer
> > fft | 544
> [Timers_stop]:
> > Trying to stop new timer fft
> > 541 [Timers_start]: Trying to start new timer
> > fft | 545 [Timers_start]:
> > Trying to start new timer fft
> > 542 [Timers_stop]: Trying to stop new timer
> > fft | 546
> [Timers_stop]:
> > Trying to stop new timer fft
> > 543 [Timers_start]: Trying to start new timer
> > fft | 547 [Timers_start]:
> > Trying to start new timer fft
> > 544 [Timers_stop]: Trying to stop new timer
> > fft | 548
> [Timers_stop]:
> > Trying to stop new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 549 [Timers_start]: Trying to start new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 550 [Timers_stop]: Trying to stop new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 551 [Timers_start]: Trying to start new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 552 [Timers_stop]: Trying to stop new timer fft
> > 545 [Timers_start]: Trying to start new timer
> > gr_hgBndry | 553 [Timers_start]:
> > Trying to start new timer gr_hgBndry
> > 546 [Timers_start]: Trying to start new timer work
> > copy | 554 [Timers_start]:
> Trying
> > to start new timer work copy
> > 547 [Timers_stop]: Trying to stop new timer work
> > copy | 555 [Timers_stop]:
> Trying
> > to stop new timer work copy
> > 548 [Timers_start]: Trying to start new timer
> > gr_hgGuardCell | 556 [Timers_start]:
> > Trying to start new timer gr_hgGuardCell
> > 549 [Timers_stop]: Trying to stop new timer
> > gr_hgGuardCell | 557
> [Timers_stop]:
> > Trying to stop new timer gr_hgGuardCell
> > 550 [Timers_start]: Trying to start new timer work
> > copy | 558 [Timers_start]:
> Trying
> > to start new timer work copy
> >
> > And grep'ing for call Timers_start("fft") I see:
> >
> > josh at iris2:~/flash/src/flash4.2.2-rad/object$ grep -iIn 'call
> > Timers_start("fft")' *
> > gr_hgSolveLevel.F90:176: call Timers_start("fft")
> > gr_hgSolveLevel.F90:194: call Timers_start("fft") !trick to
> keep
> > timers structure on different procs the same - KW
> >
> > So I'm hoping KW remembers the trick? :)
> >
> > Cordially,
> >
> > Josh
> >
> > On Wed, Sep 12, 2018 at 11:35 AM Klaus Weide
> <klaus at flash.uchicago.edu <mailto:klaus at flash.uchicago.edu>
> > <mailto:klaus at flash.uchicago.edu
> <mailto:klaus at flash.uchicago.edu>>> wrote:
> >
> > On Tue, 11 Sep 2018, Joshua Wall wrote:
> >
> > > Ryan,
> > >
> > > Another clue I just found that probably supports Klaus's
> > theory of some
> > > timer started that didn't get stopped is the following I found
> > in my log
> > > file:
> > >
> > > [ 09-10-2018 23:31:07.824 ] [Timers_getSummary]: Not
> writing timer
> > > max/min/avg values because not all processors had same timers
> >
> > This can happen if not all procs execute the same code (in
> > particular: the
> > same Timers calls). In particular, if you are running on
> more procs
> > than you have blocks, especially in the initial step of a
> simulation.
> > By itself this should be harmless and not cause the other
> problems
> > reported.
> >
> > > In case you find the same. To be clear on what I'm running
> with
> > exactly:
> > >
> > > Hydro: USM
> > > Grav: MG
> > > +cube16
> > > AMR
> > > +pm4dev
> > > +supportPPMupwind
> > > maxblocks = 50
> > > -a -3d
> > > optimization: -03
> > > MPI: OpenMPI 1.10.02
> > > compiler: gnu 4.8.0 (with the -O0 fix for
> mpi_amr_1blk_guardcell.o)
> > >
> > >
> > > Klaus,
> > >
> > > Would it be possible to call Timer_getSummary() after each
> > call to the
> > > individual units in Driver_evolveFlash in an attempt to
> find the
> > unit
> > > responsible by "bisection"? Essentially the same log
> message as
> > above
> > > should print right after some unit gets the timers on the
> different
> > > processors out of sync. Does that seem okay to try?
> >
> >
> > I don't know whether Timers_getSummary behaves well if it is
> > called more
> > than once. Normally, as you know, it is called exactly once, at
> > the end of
> > a run. You are free to experiment, of course!
> >
> >
> > The following may be also useful:
> >
> > You should be able to completely disable the Timers code but
> > setting up
> > using the following in your setup command:
> >
> > --without-unit=monitors/Timers
> >
> > That should allow you to test the remainder of your code without
> > interference from improperly nested Timers calls.
> >
> > Klaus
> >
> > --
> > Joshua Wall
> > Doctoral Candidate
> > Department of Physics
> > Drexel University
> > 3141 Chestnut Street
> > Philadelphia, PA 19104
>
> --
> Joshua Wall
> Doctoral Candidate
> Department of Physics
> Drexel University
> 3141 Chestnut Street
> Philadelphia, PA 19104
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tplewa.vcf
Type: text/x-vcard
Size: 425 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180912/1ba86c17/attachment-0001.vcf>
More information about the flash-users
mailing list