[FLASH-USERS] Timer runs out of space

Joshua Wall joshua.e.wall at gmail.com
Wed Sep 12 13:42:37 EDT 2018


Hello Tomek,

Ah I see. Is this flag available in FLASH 4.2.2, and if so how would I set
it?

Also going to now push this back to the flash-users list so that others can
search for this in the future.

Josh

On Wed, Sep 12, 2018 at 1:28 PM Tomasz Plewa <tplewa at fsu.edu> wrote:

> Josh -
>
> Your problem sounded familiar. The issue is that a multigrid-based
> solution is obtained in a composite way with coarsening of mesh
> structure. This in turn implies participation of progressively fewer
> processes as the V-cycle sweeps from fine to coarse. So eventually a
> pool of processes not performing relaxations emerges and grows (it may
> also be non-empty from the very beginning).
>
> We "solved" this issue by passing a flag to timers that disables/enables
> timing of such problematic code sections.
>
> Tomek
> --
> On 09/12/18 13:08, Joshua Wall wrote:
> > Hello Klaus and Ryan,
> >
> >     Thanks to a very helpful suggestion from Tomek, I modified my bit
> > of debugging code.
> > Now each processor will write out a statement whenever a new timer is
> > started (the timer
> > index increases) to a separate file for each processor. These files
> > can then be diff'ed to see
> > if any processor is creating a new timer that the others don't have.
> > I'll attach my new versions of
> > the files on this email. (I also added notes for the debugging text
> > just in case you'd like to merge this
> > into FLASH at any point)
> >
> > Using this, I've tracked my problem down to the fft solver timer
> > (which I promise I've made NO edits to!). Here's my vimdiff of the root
> > timer_debug0000.txt and timer_debug0119.txt:
> >
> >     218 [Timers_start]: Trying to start new timer
> > fft                                        |    218 [Timers_start]:
> > Trying to start new timer fft
> >     219 [Timers_stop]: Trying to stop new timer
> > fft                                          |    219 [Timers_stop]:
> > Trying to stop new timer fft
> >     220 [Timers_start]: Trying to start new timer
> > fft                                        |    220 [Timers_start]:
> > Trying to start new timer fft
> >     221 [Timers_stop]: Trying to stop new timer
> > fft                                          |    221 [Timers_stop]:
> > Trying to stop new timer fft
> >     222 [Timers_start]: Trying to start new timer
> > fft                                        |    222 [Timers_start]:
> > Trying to start new timer fft
> >     223 [Timers_stop]: Trying to stop new timer
> > fft                                          |    223 [Timers_stop]:
> > Trying to stop new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 224 [Timers_start]: Trying to start new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 225 [Timers_stop]: Trying to stop new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 226 [Timers_start]: Trying to start new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 227 [Timers_stop]: Trying to stop new timer fft
> >     224 [Timers_start]: Trying to start new timer
> > gr_hgBndry                                 |    228 [Timers_start]:
> > Trying to start new timer gr_hgBndry
> >     225 [Timers_start]: Trying to start new timer work
> > copy                                  |    229 [Timers_start]: Trying
> > to start new timer work copy
> >     226 [Timers_stop]: Trying to stop new timer work
> > copy                                    |    230 [Timers_stop]: Trying
> > to stop new timer work copy
> >     227 [Timers_start]: Trying to start new timer
> > gr_hgGuardCell                             |    231 [Timers_start]:
> > Trying to start new timer gr_hgGuardCell
> >     228 [Timers_stop]: Trying to stop new timer
> > gr_hgGuardCell                               |    232 [Timers_stop]:
> > Trying to stop new timer gr_hgGuardCell
> >     229 [Timers_start]: Trying to start new timer work
> > copy                                  |    233 [Timers_start]: Trying
> > to start new timer work copy
> > +   230 +--309 lines: [Timers_stop]: Trying to stop new timer work
> > copy----------------------|+   234 +--309 lines: [Timers_stop]: Trying
> > to stop new timer work copy---------------------
> >     539 [Timers_start]: Trying to start new timer
> > fft                                        |    543 [Timers_start]:
> > Trying to start new timer fft
> >     540 [Timers_stop]: Trying to stop new timer
> > fft                                          |    544 [Timers_stop]:
> > Trying to stop new timer fft
> >     541 [Timers_start]: Trying to start new timer
> > fft                                        |    545 [Timers_start]:
> > Trying to start new timer fft
> >     542 [Timers_stop]: Trying to stop new timer
> > fft                                          |    546 [Timers_stop]:
> > Trying to stop new timer fft
> >     543 [Timers_start]: Trying to start new timer
> > fft                                        |    547 [Timers_start]:
> > Trying to start new timer fft
> >     544 [Timers_stop]: Trying to stop new timer
> > fft                                          |    548 [Timers_stop]:
> > Trying to stop new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 549 [Timers_start]: Trying to start new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 550 [Timers_stop]: Trying to stop new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 551 [Timers_start]: Trying to start new timer fft
> >
> -------------------------------------------------------------------------------------|
>
> > 552 [Timers_stop]: Trying to stop new timer fft
> >     545 [Timers_start]: Trying to start new timer
> > gr_hgBndry                                 |    553 [Timers_start]:
> > Trying to start new timer gr_hgBndry
> >     546 [Timers_start]: Trying to start new timer work
> > copy                                  |    554 [Timers_start]: Trying
> > to start new timer work copy
> >     547 [Timers_stop]: Trying to stop new timer work
> > copy                                    |    555 [Timers_stop]: Trying
> > to stop new timer work copy
> >     548 [Timers_start]: Trying to start new timer
> > gr_hgGuardCell                             |    556 [Timers_start]:
> > Trying to start new timer gr_hgGuardCell
> >     549 [Timers_stop]: Trying to stop new timer
> > gr_hgGuardCell                               |    557 [Timers_stop]:
> > Trying to stop new timer gr_hgGuardCell
> >     550 [Timers_start]: Trying to start new timer work
> > copy                                  |    558 [Timers_start]: Trying
> > to start new timer work copy
> >
> > And grep'ing for call Timers_start("fft") I see:
> >
> > josh at iris2:~/flash/src/flash4.2.2-rad/object$ grep -iIn 'call
> > Timers_start("fft")' *
> > gr_hgSolveLevel.F90:176:        call Timers_start("fft")
> > gr_hgSolveLevel.F90:194:     call Timers_start("fft") !trick to keep
> > timers structure on different procs the same - KW
> >
> > So I'm hoping KW remembers the trick? :)
> >
> > Cordially,
> >
> > Josh
> >
> > On Wed, Sep 12, 2018 at 11:35 AM Klaus Weide <klaus at flash.uchicago.edu
> > <mailto:klaus at flash.uchicago.edu>> wrote:
> >
> >     On Tue, 11 Sep 2018, Joshua Wall wrote:
> >
> >     > Ryan,
> >     >
> >     >    Another clue I just found that probably supports Klaus's
> >     theory of some
> >     > timer started that didn't get stopped is the following I found
> >     in my log
> >     > file:
> >     >
> >     > [ 09-10-2018  23:31:07.824 ] [Timers_getSummary]: Not writing timer
> >     > max/min/avg values because not all processors had same timers
> >
> >     This can happen if not all procs execute the same code (in
> >     particular: the
> >     same Timers calls). In particular, if you are running on more procs
> >     than you have blocks, especially in the initial step of a simulation.
> >     By itself this should be harmless and not cause the other problems
> >     reported.
> >
> >     > In case you find the same. To be clear on what I'm running with
> >     exactly:
> >     >
> >     > Hydro: USM
> >     > Grav: MG
> >     > +cube16
> >     > AMR
> >     > +pm4dev
> >     > +supportPPMupwind
> >     > maxblocks = 50
> >     > -a -3d
> >     > optimization: -03
> >     > MPI: OpenMPI 1.10.02
> >     > compiler: gnu 4.8.0 (with the -O0 fix for mpi_amr_1blk_guardcell.o)
> >     >
> >     >
> >     > Klaus,
> >     >
> >     >    Would it be possible to call Timer_getSummary() after each
> >     call to the
> >     > individual units in Driver_evolveFlash in an attempt to find the
> >     unit
> >     > responsible by "bisection"? Essentially the same log message as
> >     above
> >     > should print right after some unit gets the timers on the different
> >     > processors out of sync. Does that seem okay to try?
> >
> >
> >     I don't know whether Timers_getSummary behaves well if it is
> >     called more
> >     than once. Normally, as you know, it is called exactly once, at
> >     the end of
> >     a run. You are free to experiment, of course!
> >
> >
> >     The following may be also useful:
> >
> >     You should be able to completely disable the Timers code but
> >     setting up
> >     using the following in your setup command:
> >
> >        --without-unit=monitors/Timers
> >
> >     That should allow you to test the remainder of your code without
> >     interference from improperly nested Timers calls.
> >
> >     Klaus
> >
> > --
> > Joshua Wall
> > Doctoral Candidate
> > Department of Physics
> > Drexel University
> > 3141 Chestnut Street
> > Philadelphia, PA 19104
>
> --
Joshua Wall
Doctoral Candidate
Department of Physics
Drexel University
3141 Chestnut Street
Philadelphia, PA 19104
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180912/7367d54d/attachment-0001.htm>


More information about the flash-users mailing list