[FLASH-USERS] Timer runs out of space

Ryan Farber rjfarber at umich.edu
Wed Sep 12 16:30:37 EDT 2018


Tomek, I don't quite understand what you wrote. Won't all processes execute
the same code unless embedded by some conditional to exclude them (e.g.,
"if (my_rank == MASTER_PE) then .... endif")? Any such conditional would be
noticeable by looking at the source code between a Timers_start and a
Timers_stop.

Another thing to note is that grepping just for "call Timers" is not quite
general enough because there are also "#IO_TIMERS_START/STOP(blah)" pairs
of statements as well.

However, I didn't notice anything wrong with those pairs in my case either.
I'm content with just setting tmr_maxTimerParents = suitably large number
as a quick fix.

Best,
--------
Ryan

On Wed, Sep 12, 2018 at 9:45 AM, Tomasz Plewa <tplewa at fsu.edu> wrote:

> Apart from a simple missing start/stop pairing issues, one or more timers
> start/stop calls might be embedded in the code section which is not
> executed by all participating processes. The latter cannot be identified by
> analyzing the source code and needs tracking of timer calls from individual
> processes.
>
> Tomek
> --
> On 09/12/18 09:27, Joshua Wall wrote:
>
>> Ryan,
>>
>>     Thanks for the info. Indeed my issue wasn't that runs were stopping,
>> but my output was being flooded with the timer error messages about running
>> out of space.
>>
>>     I'm also fairly sure I'm not overproducing timers or replicating any
>> specific timer more than once... I made a couple of debugging statements in
>> Timers_start and Timers_stop that allows me to look for this. I'll attach a
>> copy for you in case you want to check for the same. Just grep your output
>> file like this:
>>
>> cat output.txt | grep -iIn "timer" > timer_grep.txt
>>
>>     and you should see all the timers it makes, including if something
>> with the same name gets made more than once (the name will repeat for
>> different indices if I'm reading how the timers are tracked correctly). In
>> my case I can see that the root process has 52 timers after the first loop
>> completes and 54 after the first output files are written, then it never
>> grows after this. I'm honestly not sure why I haven't seen this message
>> before now if the limit on timers has always been 30... it only appeared as
>> I moved to running on > 256 processors. Actually this test where I see 54
>> timers on the root process still actually has tmr_maxTimerParents=30, and
>> it shows no error messages from the Timers module (its only running on 120
>> processors).
>>
>> Josh
>>
>>
>> On Tue, Sep 11, 2018, 5:39 PM Ryan Farber <rjfarber at umich.edu <mailto:
>> rjfarber at umich.edu>> wrote:
>>
>>     Hi Josh,
>>
>>     I solved the "perfmon" error by copying Timers_data.F90 to my
>>     problem directory and setting tmr_maxTimerParents = 1000.
>>
>>     Explanation: I noticed that the strings in call Timers_start/stop
>>     seemed to be much greater than 30 (for me it is of order 100).
>>     The "perfmon" message is in Timers_start.F90 for the case that j <
>>     0 where j gets its value from tmr_stackListAdd().
>>     tmr_stackListAdd is a subroutine in tmr_stackLib.F90 which passes
>>     -1 if the referenced string == tmr_maxTimerParents.
>>
>>     However, the "perfmon" message doesn't actually throw an
>>     exception. I forgot that I've had runs (probably the static mesh
>>     refinement ones too but I can't check with Stampede2 down for
>>     maintenance today) that run okay even with this perfmon error.
>>
>>     My actual error message (which I'll still have to debug) is:
>>     DRIVER_ABORT: Error: Grid_getBlkBoundBox, blockId out of bounds
>>     blockID = 0
>>
>>     I'm guessing your error is different.
>>
>>     Still, that solves one mystery.
>>
>>     Best,
>>     --------
>>     Ryan
>>
>>     On Tue, Sep 11, 2018 at 4:34 PM, Joshua Wall
>>     <joshua.e.wall at gmail.com <mailto:joshua.e.wall at gmail.com>> wrote:
>>
>>         Ryan,
>>
>>            Another clue I just found that probably supports Klaus's
>>         theory of some timer started that didn't get stopped is the
>>         following I found in my log file:
>>
>>         [ 09-10-2018  23:31:07.824 ] [Timers_getSummary]: Not writing
>>         timer max/min/avg values because not all processors had same
>>         timers
>>
>>         In case you find the same. To be clear on what I'm running
>>         with exactly:
>>
>>         Hydro: USM
>>         Grav: MG
>>         +cube16
>>         AMR
>>         +pm4dev
>>         +supportPPMupwind
>>         maxblocks = 50
>>         -a -3d
>>         optimization: -03
>>         MPI: OpenMPI 1.10.02
>>         compiler: gnu 4.8.0 (with the -O0 fix for
>>         mpi_amr_1blk_guardcell.o)
>>
>>
>>         Klaus,
>>
>>            Would it be possible to call Timer_getSummary() after each
>>         call to the individual units in Driver_evolveFlash in an
>>         attempt to find the unit responsible by "bisection"?
>>         Essentially the same log message as above should print right
>>         after some unit gets the timers on the different processors
>>         out of sync. Does that seem okay to try?
>>
>>         Cordially,
>>
>>         Josh
>>
>>         On Tue, Sep 11, 2018 at 3:03 PM Joshua Wall
>>         <joshua.e.wall at gmail.com <mailto:joshua.e.wall at gmail.com>> wrote:
>>
>>             Hello Ryan and Klaus,
>>
>>                 Ryan, yes I am running in AMR. Per Klaus's
>>             recommendation I just did:
>>
>>             josh at iris2:~/flash/src/flash4.2.2-rad/object$ grep -iIn
>>             "call Timers" * > timers_grep.txt
>>
>>             Which will give me a list of calls to Timers_start and
>>             Timers_stop in one text file. After I check this file to
>>             ensure that all have pairs, I'll have to check each FLASH
>>             file for "if () then" and "#ifdef" statements that might
>>             have isolated one of the timer calls. Feel free to also do
>>             this as a check in case I miss anything. Whoever finds it
>>             first can post what we find here (or if we don't find
>>             anything).
>>
>>             Cordially,
>>
>>             Josh
>>
>>             On Tue, Sep 11, 2018 at 2:54 PM Ryan Farber
>>             <rjfarber at umich.edu <mailto:rjfarber at umich.edu>> wrote:
>>
>>                 Hi Josh/Klaus,
>>
>>                 I'm also suffering this problem. It only appears for
>>                 me when I run my problem in AMR. That is, it doesn't
>>                 happen if I have nrefs=100000. Josh, do you know if
>>                 that is also the case for you?
>>
>>                 Klaus, thanks for the explanation I will look for
>>                 Timers_start/stop pairs to see if that is the issue.
>>
>>                 Best,
>>                 --------
>>                 Ryan
>>
>>                 On Tue, Sep 11, 2018 at 2:44 PM, Klaus Weide
>>                 <klaus at flash.uchicago.edu
>>                 <mailto:klaus at flash.uchicago.edu>> wrote:
>>
>>                     On Tue, 11 Sep 2018, Joshua Wall wrote:
>>
>>                     > Hello FLASH users,
>>                     >
>>                     > I'm attempting to track down a strange
>>                     occurrence in one of my runs (using
>>                     > FLASH 4.2.2), where I get the following error
>>                     related to timing of the
>>                     > multigrid solver:
>>                     >
>>                     >     6043 perfmon: ran out of space for timer,
>>                     "gr_hgBndry", cannot time
>>                     > this timer with perfmon
>>                     >     6044 perfmon: ran out of space for timer,
>>                     "work copy", cannot time
>>                     > this timer with perfmon
>>                     >     6045 perfmon: ran out of space for timer,
>>                     "gr_hgGuardCell", cannot
>>                     > time this timer with perfmon
>>
>>                     This could happen if Timers_start / Timers_stop
>>                     call pairs are incomplete,
>>                     or not properly nested.
>>
>>                     It is possible that the offending unbalanced code
>>                     is not in the multigrid
>>                     solver at all, but in an entirely different part
>>                     of the code - the error
>>                     could just happen to cause an overflow of the
>>                     Timers stack there first.
>>
>>                     Klaus
>>
>>
>>             --             Joshua Wall
>>             Doctoral Candidate
>>             Department of Physics
>>             Drexel University
>>             3141 Chestnut Street
>>             <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.
>> google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladel
>> phia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&
>> c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOC
>> BqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&
>> s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=>
>>             Philadelphia, PA 19104
>>             <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.
>> google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladel
>> phia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&
>> c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOC
>> BqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&
>> s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=>
>>
>>         --         Joshua Wall
>>         Doctoral Candidate
>>         Department of Physics
>>         Drexel University
>>         3141 Chestnut Street
>>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.
>> google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladel
>> phia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&
>> c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOC
>> BqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&
>> s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=>
>>         Philadelphia, PA 19104
>>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.
>> google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladel
>> phia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&
>> c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOC
>> BqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&
>> s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=>
>>
>>
>> --
>> Joshua Wall
>> Doctoral Candidate
>> Department of Physics
>> Drexel University
>> 3141 Chestnut Street
>> Philadelphia, PA 19104
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180912/d8cec33b/attachment-0001.htm>


More information about the flash-users mailing list