[FLASH-USERS] Timer runs out of space
Ryan Farber
rjfarber at umich.edu
Wed Sep 12 16:30:37 EDT 2018
Tomek, I don't quite understand what you wrote. Won't all processes execute
the same code unless embedded by some conditional to exclude them (e.g.,
"if (my_rank == MASTER_PE) then .... endif")? Any such conditional would be
noticeable by looking at the source code between a Timers_start and a
Timers_stop.
Another thing to note is that grepping just for "call Timers" is not quite
general enough because there are also "#IO_TIMERS_START/STOP(blah)" pairs
of statements as well.
However, I didn't notice anything wrong with those pairs in my case either.
I'm content with just setting tmr_maxTimerParents = suitably large number
as a quick fix.
Best,
--------
Ryan
On Wed, Sep 12, 2018 at 9:45 AM, Tomasz Plewa <tplewa at fsu.edu> wrote:
> Apart from a simple missing start/stop pairing issues, one or more timers
> start/stop calls might be embedded in the code section which is not
> executed by all participating processes. The latter cannot be identified by
> analyzing the source code and needs tracking of timer calls from individual
> processes.
>
> Tomek
> --
> On 09/12/18 09:27, Joshua Wall wrote:
>
>> Ryan,
>>
>> Thanks for the info. Indeed my issue wasn't that runs were stopping,
>> but my output was being flooded with the timer error messages about running
>> out of space.
>>
>> I'm also fairly sure I'm not overproducing timers or replicating any
>> specific timer more than once... I made a couple of debugging statements in
>> Timers_start and Timers_stop that allows me to look for this. I'll attach a
>> copy for you in case you want to check for the same. Just grep your output
>> file like this:
>>
>> cat output.txt | grep -iIn "timer" > timer_grep.txt
>>
>> and you should see all the timers it makes, including if something
>> with the same name gets made more than once (the name will repeat for
>> different indices if I'm reading how the timers are tracked correctly). In
>> my case I can see that the root process has 52 timers after the first loop
>> completes and 54 after the first output files are written, then it never
>> grows after this. I'm honestly not sure why I haven't seen this message
>> before now if the limit on timers has always been 30... it only appeared as
>> I moved to running on > 256 processors. Actually this test where I see 54
>> timers on the root process still actually has tmr_maxTimerParents=30, and
>> it shows no error messages from the Timers module (its only running on 120
>> processors).
>>
>> Josh
>>
>>
>> On Tue, Sep 11, 2018, 5:39 PM Ryan Farber <rjfarber at umich.edu <mailto:
>> rjfarber at umich.edu>> wrote:
>>
>> Hi Josh,
>>
>> I solved the "perfmon" error by copying Timers_data.F90 to my
>> problem directory and setting tmr_maxTimerParents = 1000.
>>
>> Explanation: I noticed that the strings in call Timers_start/stop
>> seemed to be much greater than 30 (for me it is of order 100).
>> The "perfmon" message is in Timers_start.F90 for the case that j <
>> 0 where j gets its value from tmr_stackListAdd().
>> tmr_stackListAdd is a subroutine in tmr_stackLib.F90 which passes
>> -1 if the referenced string == tmr_maxTimerParents.
>>
>> However, the "perfmon" message doesn't actually throw an
>> exception. I forgot that I've had runs (probably the static mesh
>> refinement ones too but I can't check with Stampede2 down for
>> maintenance today) that run okay even with this perfmon error.
>>
>> My actual error message (which I'll still have to debug) is:
>> DRIVER_ABORT: Error: Grid_getBlkBoundBox, blockId out of bounds
>> blockID = 0
>>
>> I'm guessing your error is different.
>>
>> Still, that solves one mystery.
>>
>> Best,
>> --------
>> Ryan
>>
>> On Tue, Sep 11, 2018 at 4:34 PM, Joshua Wall
>> <joshua.e.wall at gmail.com <mailto:joshua.e.wall at gmail.com>> wrote:
>>
>> Ryan,
>>
>> Another clue I just found that probably supports Klaus's
>> theory of some timer started that didn't get stopped is the
>> following I found in my log file:
>>
>> [ 09-10-2018 23:31:07.824 ] [Timers_getSummary]: Not writing
>> timer max/min/avg values because not all processors had same
>> timers
>>
>> In case you find the same. To be clear on what I'm running
>> with exactly:
>>
>> Hydro: USM
>> Grav: MG
>> +cube16
>> AMR
>> +pm4dev
>> +supportPPMupwind
>> maxblocks = 50
>> -a -3d
>> optimization: -03
>> MPI: OpenMPI 1.10.02
>> compiler: gnu 4.8.0 (with the -O0 fix for
>> mpi_amr_1blk_guardcell.o)
>>
>>
>> Klaus,
>>
>> Would it be possible to call Timer_getSummary() after each
>> call to the individual units in Driver_evolveFlash in an
>> attempt to find the unit responsible by "bisection"?
>> Essentially the same log message as above should print right
>> after some unit gets the timers on the different processors
>> out of sync. Does that seem okay to try?
>>
>> Cordially,
>>
>> Josh
>>
>> On Tue, Sep 11, 2018 at 3:03 PM Joshua Wall
>> <joshua.e.wall at gmail.com <mailto:joshua.e.wall at gmail.com>> wrote:
>>
>> Hello Ryan and Klaus,
>>
>> Ryan, yes I am running in AMR. Per Klaus's
>> recommendation I just did:
>>
>> josh at iris2:~/flash/src/flash4.2.2-rad/object$ grep -iIn
>> "call Timers" * > timers_grep.txt
>>
>> Which will give me a list of calls to Timers_start and
>> Timers_stop in one text file. After I check this file to
>> ensure that all have pairs, I'll have to check each FLASH
>> file for "if () then" and "#ifdef" statements that might
>> have isolated one of the timer calls. Feel free to also do
>> this as a check in case I miss anything. Whoever finds it
>> first can post what we find here (or if we don't find
>> anything).
>>
>> Cordially,
>>
>> Josh
>>
>> On Tue, Sep 11, 2018 at 2:54 PM Ryan Farber
>> <rjfarber at umich.edu <mailto:rjfarber at umich.edu>> wrote:
>>
>> Hi Josh/Klaus,
>>
>> I'm also suffering this problem. It only appears for
>> me when I run my problem in AMR. That is, it doesn't
>> happen if I have nrefs=100000. Josh, do you know if
>> that is also the case for you?
>>
>> Klaus, thanks for the explanation I will look for
>> Timers_start/stop pairs to see if that is the issue.
>>
>> Best,
>> --------
>> Ryan
>>
>> On Tue, Sep 11, 2018 at 2:44 PM, Klaus Weide
>> <klaus at flash.uchicago.edu
>> <mailto:klaus at flash.uchicago.edu>> wrote:
>>
>> On Tue, 11 Sep 2018, Joshua Wall wrote:
>>
>> > Hello FLASH users,
>> >
>> > I'm attempting to track down a strange
>> occurrence in one of my runs (using
>> > FLASH 4.2.2), where I get the following error
>> related to timing of the
>> > multigrid solver:
>> >
>> > 6043 perfmon: ran out of space for timer,
>> "gr_hgBndry", cannot time
>> > this timer with perfmon
>> > 6044 perfmon: ran out of space for timer,
>> "work copy", cannot time
>> > this timer with perfmon
>> > 6045 perfmon: ran out of space for timer,
>> "gr_hgGuardCell", cannot
>> > time this timer with perfmon
>>
>> This could happen if Timers_start / Timers_stop
>> call pairs are incomplete,
>> or not properly nested.
>>
>> It is possible that the offending unbalanced code
>> is not in the multigrid
>> solver at all, but in an entirely different part
>> of the code - the error
>> could just happen to cause an overflow of the
>> Timers stack there first.
>>
>> Klaus
>>
>>
>> -- Joshua Wall
>> Doctoral Candidate
>> Department of Physics
>> Drexel University
>> 3141 Chestnut Street
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.
>> google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladel
>> phia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&
>> c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOC
>> BqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&
>> s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=>
>> Philadelphia, PA 19104
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.
>> google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladel
>> phia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&
>> c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOC
>> BqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&
>> s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=>
>>
>> -- Joshua Wall
>> Doctoral Candidate
>> Department of Physics
>> Drexel University
>> 3141 Chestnut Street
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.
>> google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladel
>> phia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&
>> c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOC
>> BqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&
>> s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=>
>> Philadelphia, PA 19104
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.
>> google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladel
>> phia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&
>> c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOC
>> BqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&
>> s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=>
>>
>>
>> --
>> Joshua Wall
>> Doctoral Candidate
>> Department of Physics
>> Drexel University
>> 3141 Chestnut Street
>> Philadelphia, PA 19104
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180912/d8cec33b/attachment-0001.htm>
More information about the flash-users
mailing list