[FLASH-USERS] Timer runs out of space

Tomasz Plewa tplewa at fsu.edu
Thu Sep 13 07:00:54 EDT 2018


Hi Ryan -

Certain sections of the code containing pairs of timers might be 
differently executed by different processes. For example, a do-loop 
count may depend on the number of active blocks, and that number is 
greater than or equal to zero. No blocks case is bad as the timers will 
not be called.

You are right that simply grepping the source code for calls to timers 
start/stop is not sufficient. It is simply a cheap attempt to identify a 
possible problem. It will not help in the above case, though.

Tomek
--
On 09/12/18 16:30, Ryan Farber wrote:
> Tomek, I don't quite understand what you wrote. Won't all processes 
> execute the same code unless embedded by some conditional to exclude 
> them (e.g., "if (my_rank == MASTER_PE) then .... endif")? Any such 
> conditional would be noticeable by looking at the source code between 
> a Timers_start and a Timers_stop.
>
> Another thing to note is that grepping just for "call Timers" is not 
> quite general enough because there are also 
> "#IO_TIMERS_START/STOP(blah)" pairs of statements as well.
>
> However, I didn't notice anything wrong with those pairs in my case 
> either. I'm content with just setting tmr_maxTimerParents = suitably 
> large number as a quick fix.
>
> Best,
> --------
> Ryan
>
> On Wed, Sep 12, 2018 at 9:45 AM, Tomasz Plewa <tplewa at fsu.edu 
> <mailto:tplewa at fsu.edu>> wrote:
>
>     Apart from a simple missing start/stop pairing issues, one or more
>     timers start/stop calls might be embedded in the code section
>     which is not executed by all participating processes. The latter
>     cannot be identified by analyzing the source code and needs
>     tracking of timer calls from individual processes.
>
>     Tomek
>     --
>     On 09/12/18 09:27, Joshua Wall wrote:
>
>         Ryan,
>
>             Thanks for the info. Indeed my issue wasn't that runs were
>         stopping, but my output was being flooded with the timer error
>         messages about running out of space.
>
>             I'm also fairly sure I'm not overproducing timers or
>         replicating any specific timer more than once... I made a
>         couple of debugging statements in Timers_start and Timers_stop
>         that allows me to look for this. I'll attach a copy for you in
>         case you want to check for the same. Just grep your output
>         file like this:
>
>         cat output.txt | grep -iIn "timer" > timer_grep.txt
>
>             and you should see all the timers it makes, including if
>         something with the same name gets made more than once (the
>         name will repeat for different indices if I'm reading how the
>         timers are tracked correctly). In my case I can see that the
>         root process has 52 timers after the first loop completes and
>         54 after the first output files are written, then it never
>         grows after this. I'm honestly not sure why I haven't seen
>         this message before now if the limit on timers has always been
>         30... it only appeared as I moved to running on > 256
>         processors. Actually this test where I see 54 timers on the
>         root process still actually has tmr_maxTimerParents=30, and it
>         shows no error messages from the Timers module (its only
>         running on 120 processors).
>
>         Josh
>
>
>         On Tue, Sep 11, 2018, 5:39 PM Ryan Farber <rjfarber at umich.edu
>         <mailto:rjfarber at umich.edu> <mailto:rjfarber at umich.edu
>         <mailto:rjfarber at umich.edu>>> wrote:
>
>             Hi Josh,
>
>             I solved the "perfmon" error by copying Timers_data.F90 to my
>             problem directory and setting tmr_maxTimerParents = 1000.
>
>             Explanation: I noticed that the strings in call
>         Timers_start/stop
>             seemed to be much greater than 30 (for me it is of order 100).
>             The "perfmon" message is in Timers_start.F90 for the case
>         that j <
>             0 where j gets its value from tmr_stackListAdd().
>             tmr_stackListAdd is a subroutine in tmr_stackLib.F90 which
>         passes
>             -1 if the referenced string == tmr_maxTimerParents.
>
>             However, the "perfmon" message doesn't actually throw an
>             exception. I forgot that I've had runs (probably the
>         static mesh
>             refinement ones too but I can't check with Stampede2 down for
>             maintenance today) that run okay even with this perfmon error.
>
>             My actual error message (which I'll still have to debug) is:
>             DRIVER_ABORT: Error: Grid_getBlkBoundBox, blockId out of
>         bounds
>             blockID = 0
>
>             I'm guessing your error is different.
>
>             Still, that solves one mystery.
>
>             Best,
>             --------
>             Ryan
>
>             On Tue, Sep 11, 2018 at 4:34 PM, Joshua Wall
>             <joshua.e.wall at gmail.com <mailto:joshua.e.wall at gmail.com>
>         <mailto:joshua.e.wall at gmail.com
>         <mailto:joshua.e.wall at gmail.com>>> wrote:
>
>                 Ryan,
>
>                    Another clue I just found that probably supports
>         Klaus's
>                 theory of some timer started that didn't get stopped
>         is the
>                 following I found in my log file:
>
>                 [ 09-10-2018  23:31:07.824 ] [Timers_getSummary]: Not
>         writing
>                 timer max/min/avg values because not all processors
>         had same
>                 timers
>
>                 In case you find the same. To be clear on what I'm running
>                 with exactly:
>
>                 Hydro: USM
>                 Grav: MG
>                 +cube16
>                 AMR
>                 +pm4dev
>                 +supportPPMupwind
>                 maxblocks = 50
>                 -a -3d
>                 optimization: -03
>                 MPI: OpenMPI 1.10.02
>                 compiler: gnu 4.8.0 (with the -O0 fix for
>                 mpi_amr_1blk_guardcell.o)
>
>
>                 Klaus,
>
>                    Would it be possible to call Timer_getSummary()
>         after each
>                 call to the individual units in Driver_evolveFlash in an
>                 attempt to find the unit responsible by "bisection"?
>                 Essentially the same log message as above should print
>         right
>                 after some unit gets the timers on the different
>         processors
>                 out of sync. Does that seem okay to try?
>
>                 Cordially,
>
>                 Josh
>
>                 On Tue, Sep 11, 2018 at 3:03 PM Joshua Wall
>                 <joshua.e.wall at gmail.com
>         <mailto:joshua.e.wall at gmail.com>
>         <mailto:joshua.e.wall at gmail.com
>         <mailto:joshua.e.wall at gmail.com>>> wrote:
>
>                     Hello Ryan and Klaus,
>
>                         Ryan, yes I am running in AMR. Per Klaus's
>                     recommendation I just did:
>
>                     josh at iris2:~/flash/src/flash4.2.2-rad/object$ grep
>         -iIn
>                     "call Timers" * > timers_grep.txt
>
>                     Which will give me a list of calls to Timers_start and
>                     Timers_stop in one text file. After I check this
>         file to
>                     ensure that all have pairs, I'll have to check
>         each FLASH
>                     file for "if () then" and "#ifdef" statements that
>         might
>                     have isolated one of the timer calls. Feel free to
>         also do
>                     this as a check in case I miss anything. Whoever
>         finds it
>                     first can post what we find here (or if we don't find
>                     anything).
>
>                     Cordially,
>
>                     Josh
>
>                     On Tue, Sep 11, 2018 at 2:54 PM Ryan Farber
>                     <rjfarber at umich.edu <mailto:rjfarber at umich.edu>
>         <mailto:rjfarber at umich.edu <mailto:rjfarber at umich.edu>>> wrote:
>
>                         Hi Josh/Klaus,
>
>                         I'm also suffering this problem. It only
>         appears for
>                         me when I run my problem in AMR. That is, it
>         doesn't
>                         happen if I have nrefs=100000. Josh, do you
>         know if
>                         that is also the case for you?
>
>                         Klaus, thanks for the explanation I will look for
>                         Timers_start/stop pairs to see if that is the
>         issue.
>
>                         Best,
>                         --------
>                         Ryan
>
>                         On Tue, Sep 11, 2018 at 2:44 PM, Klaus Weide
>                         <klaus at flash.uchicago.edu
>         <mailto:klaus at flash.uchicago.edu>
>                         <mailto:klaus at flash.uchicago.edu
>         <mailto:klaus at flash.uchicago.edu>>> wrote:
>
>                             On Tue, 11 Sep 2018, Joshua Wall wrote:
>
>                             > Hello FLASH users,
>                             >
>                             > I'm attempting to track down a strange
>                             occurrence in one of my runs (using
>                             > FLASH 4.2.2), where I get the following
>         error
>                             related to timing of the
>                             > multigrid solver:
>                             >
>                             >     6043 perfmon: ran out of space for
>         timer,
>                             "gr_hgBndry", cannot time
>                             > this timer with perfmon
>                             >     6044 perfmon: ran out of space for
>         timer,
>                             "work copy", cannot time
>                             > this timer with perfmon
>                             >     6045 perfmon: ran out of space for
>         timer,
>                             "gr_hgGuardCell", cannot
>                             > time this timer with perfmon
>
>                             This could happen if Timers_start /
>         Timers_stop
>                             call pairs are incomplete,
>                             or not properly nested.
>
>                             It is possible that the offending
>         unbalanced code
>                             is not in the multigrid
>                             solver at all, but in an entirely
>         different part
>                             of the code - the error
>                             could just happen to cause an overflow of the
>                             Timers stack there first.
>
>                             Klaus
>
>
>                     --             Joshua Wall
>                     Doctoral Candidate
>                     Department of Physics
>                     Drexel University
>                     3141 Chestnut Street
>                    
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladelphia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOCBqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladelphia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOCBqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=>>
>                     Philadelphia, PA 19104
>                    
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladelphia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOCBqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladelphia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOCBqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=>>
>
>                 --         Joshua Wall
>                 Doctoral Candidate
>                 Department of Physics
>                 Drexel University
>                 3141 Chestnut Street
>                
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladelphia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOCBqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladelphia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOCBqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=>>
>                 Philadelphia, PA 19104
>                
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladelphia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOCBqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladelphia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOCBqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=>>
>
>
>         -- 
>         Joshua Wall
>         Doctoral Candidate
>         Department of Physics
>         Drexel University
>         3141 Chestnut Street
>         Philadelphia, PA 19104
>
>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: tplewa.vcf
Type: text/x-vcard
Size: 425 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180913/6efddea3/attachment.vcf>


More information about the flash-users mailing list