[FLASH-USERS] Timer runs out of space

Tomasz Plewa tplewa at fsu.edu
Wed Sep 12 09:45:30 EDT 2018


Apart from a simple missing start/stop pairing issues, one or more 
timers start/stop calls might be embedded in the code section which is 
not executed by all participating processes. The latter cannot be 
identified by analyzing the source code and needs tracking of timer 
calls from individual processes.

Tomek
--
On 09/12/18 09:27, Joshua Wall wrote:
> Ryan,
>
>     Thanks for the info. Indeed my issue wasn't that runs were 
> stopping, but my output was being flooded with the timer error 
> messages about running out of space.
>
>     I'm also fairly sure I'm not overproducing timers or replicating 
> any specific timer more than once... I made a couple of debugging 
> statements in Timers_start and Timers_stop that allows me to look for 
> this. I'll attach a copy for you in case you want to check for the 
> same. Just grep your output file like this:
>
> cat output.txt | grep -iIn "timer" > timer_grep.txt
>
>     and you should see all the timers it makes, including if something 
> with the same name gets made more than once (the name will repeat for 
> different indices if I'm reading how the timers are tracked 
> correctly). In my case I can see that the root process has 52 timers 
> after the first loop completes and 54 after the first output files are 
> written, then it never grows after this. I'm honestly not sure why I 
> haven't seen this message before now if the limit on timers has always 
> been 30... it only appeared as I moved to running on > 256 processors. 
> Actually this test where I see 54 timers on the root process still 
> actually has tmr_maxTimerParents=30, and it shows no error messages 
> from the Timers module (its only running on 120 processors).
>
> Josh
>
>
> On Tue, Sep 11, 2018, 5:39 PM Ryan Farber <rjfarber at umich.edu 
> <mailto:rjfarber at umich.edu>> wrote:
>
>     Hi Josh,
>
>     I solved the "perfmon" error by copying Timers_data.F90 to my
>     problem directory and setting tmr_maxTimerParents = 1000.
>
>     Explanation: I noticed that the strings in call Timers_start/stop
>     seemed to be much greater than 30 (for me it is of order 100).
>     The "perfmon" message is in Timers_start.F90 for the case that j <
>     0 where j gets its value from tmr_stackListAdd().
>     tmr_stackListAdd is a subroutine in tmr_stackLib.F90 which passes
>     -1 if the referenced string == tmr_maxTimerParents.
>
>     However, the "perfmon" message doesn't actually throw an
>     exception. I forgot that I've had runs (probably the static mesh
>     refinement ones too but I can't check with Stampede2 down for
>     maintenance today) that run okay even with this perfmon error.
>
>     My actual error message (which I'll still have to debug) is:
>     DRIVER_ABORT: Error: Grid_getBlkBoundBox, blockId out of bounds
>     blockID = 0
>
>     I'm guessing your error is different.
>
>     Still, that solves one mystery.
>
>     Best,
>     --------
>     Ryan
>
>     On Tue, Sep 11, 2018 at 4:34 PM, Joshua Wall
>     <joshua.e.wall at gmail.com <mailto:joshua.e.wall at gmail.com>> wrote:
>
>         Ryan,
>
>            Another clue I just found that probably supports Klaus's
>         theory of some timer started that didn't get stopped is the
>         following I found in my log file:
>
>         [ 09-10-2018  23:31:07.824 ] [Timers_getSummary]: Not writing
>         timer max/min/avg values because not all processors had same
>         timers
>
>         In case you find the same. To be clear on what I'm running
>         with exactly:
>
>         Hydro: USM
>         Grav: MG
>         +cube16
>         AMR
>         +pm4dev
>         +supportPPMupwind
>         maxblocks = 50
>         -a -3d
>         optimization: -03
>         MPI: OpenMPI 1.10.02
>         compiler: gnu 4.8.0 (with the -O0 fix for
>         mpi_amr_1blk_guardcell.o)
>
>
>         Klaus,
>
>            Would it be possible to call Timer_getSummary() after each
>         call to the individual units in Driver_evolveFlash in an
>         attempt to find the unit responsible by "bisection"?
>         Essentially the same log message as above should print right
>         after some unit gets the timers on the different processors
>         out of sync. Does that seem okay to try?
>
>         Cordially,
>
>         Josh
>
>         On Tue, Sep 11, 2018 at 3:03 PM Joshua Wall
>         <joshua.e.wall at gmail.com <mailto:joshua.e.wall at gmail.com>> wrote:
>
>             Hello Ryan and Klaus,
>
>                 Ryan, yes I am running in AMR. Per Klaus's
>             recommendation I just did:
>
>             josh at iris2:~/flash/src/flash4.2.2-rad/object$ grep -iIn
>             "call Timers" * > timers_grep.txt
>
>             Which will give me a list of calls to Timers_start and
>             Timers_stop in one text file. After I check this file to
>             ensure that all have pairs, I'll have to check each FLASH
>             file for "if () then" and "#ifdef" statements that might
>             have isolated one of the timer calls. Feel free to also do
>             this as a check in case I miss anything. Whoever finds it
>             first can post what we find here (or if we don't find
>             anything).
>
>             Cordially,
>
>             Josh
>
>             On Tue, Sep 11, 2018 at 2:54 PM Ryan Farber
>             <rjfarber at umich.edu <mailto:rjfarber at umich.edu>> wrote:
>
>                 Hi Josh/Klaus,
>
>                 I'm also suffering this problem. It only appears for
>                 me when I run my problem in AMR. That is, it doesn't
>                 happen if I have nrefs=100000. Josh, do you know if
>                 that is also the case for you?
>
>                 Klaus, thanks for the explanation I will look for
>                 Timers_start/stop pairs to see if that is the issue.
>
>                 Best,
>                 --------
>                 Ryan
>
>                 On Tue, Sep 11, 2018 at 2:44 PM, Klaus Weide
>                 <klaus at flash.uchicago.edu
>                 <mailto:klaus at flash.uchicago.edu>> wrote:
>
>                     On Tue, 11 Sep 2018, Joshua Wall wrote:
>
>                     > Hello FLASH users,
>                     >
>                     > I'm attempting to track down a strange
>                     occurrence in one of my runs (using
>                     > FLASH 4.2.2), where I get the following error
>                     related to timing of the
>                     > multigrid solver:
>                     >
>                     >     6043 perfmon: ran out of space for timer,
>                     "gr_hgBndry", cannot time
>                     > this timer with perfmon
>                     >     6044 perfmon: ran out of space for timer,
>                     "work copy", cannot time
>                     > this timer with perfmon
>                     >     6045 perfmon: ran out of space for timer,
>                     "gr_hgGuardCell", cannot
>                     > time this timer with perfmon
>
>                     This could happen if Timers_start / Timers_stop
>                     call pairs are incomplete,
>                     or not properly nested.
>
>                     It is possible that the offending unbalanced code
>                     is not in the multigrid
>                     solver at all, but in an entirely different part
>                     of the code - the error
>                     could just happen to cause an overflow of the
>                     Timers stack there first.
>
>                     Klaus
>
>
>             -- 
>             Joshua Wall
>             Doctoral Candidate
>             Department of Physics
>             Drexel University
>             3141 Chestnut Street
>             <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladelphia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOCBqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=>
>             Philadelphia, PA 19104
>             <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladelphia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOCBqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=>
>
>         -- 
>         Joshua Wall
>         Doctoral Candidate
>         Department of Physics
>         Drexel University
>         3141 Chestnut Street
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladelphia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOCBqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=>
>         Philadelphia, PA 19104
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladelphia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOCBqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=>
>
>
> -- 
> Joshua Wall
> Doctoral Candidate
> Department of Physics
> Drexel University
> 3141 Chestnut Street
> Philadelphia, PA 19104

-------------- next part --------------
A non-text attachment was scrubbed...
Name: tplewa.vcf
Type: text/x-vcard
Size: 425 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180912/b83c47e9/attachment-0001.vcf>


More information about the flash-users mailing list