[FLASH-USERS] Timer runs out of space

Ryan Farber rjfarber at umich.edu
Tue Sep 11 17:39:20 EDT 2018


Hi Josh,

I solved the "perfmon" error by copying Timers_data.F90 to my problem
directory and setting tmr_maxTimerParents = 1000.

Explanation: I noticed that the strings in call Timers_start/stop seemed to
be much greater than 30 (for me it is of order 100).
The "perfmon" message is in Timers_start.F90 for the case that j < 0 where
j gets its value from tmr_stackListAdd().
tmr_stackListAdd is a subroutine in tmr_stackLib.F90 which passes -1 if the
referenced string == tmr_maxTimerParents.

However, the "perfmon" message doesn't actually throw an exception. I
forgot that I've had runs (probably the static mesh refinement ones too but
I can't check with Stampede2 down for maintenance today) that run okay even
with this perfmon error.

My actual error message (which I'll still have to debug) is:
DRIVER_ABORT: Error: Grid_getBlkBoundBox, blockId out of bounds
blockID = 0

I'm guessing your error is different.

Still, that solves one mystery.

Best,
--------
Ryan

On Tue, Sep 11, 2018 at 4:34 PM, Joshua Wall <joshua.e.wall at gmail.com>
wrote:

> Ryan,
>
>    Another clue I just found that probably supports Klaus's theory of some
> timer started that didn't get stopped is the following I found in my log
> file:
>
> [ 09-10-2018  23:31:07.824 ] [Timers_getSummary]: Not writing timer
> max/min/avg values because not all processors had same timers
>
> In case you find the same. To be clear on what I'm running with exactly:
>
> Hydro: USM
> Grav: MG
> +cube16
> AMR
> +pm4dev
> +supportPPMupwind
> maxblocks = 50
> -a -3d
> optimization: -03
> MPI: OpenMPI 1.10.02
> compiler: gnu 4.8.0 (with the -O0 fix for mpi_amr_1blk_guardcell.o)
>
>
> Klaus,
>
>    Would it be possible to call Timer_getSummary() after each call to the
> individual units in Driver_evolveFlash in an attempt to find the unit
> responsible by "bisection"? Essentially the same log message as above
> should print right after some unit gets the timers on the different
> processors out of sync. Does that seem okay to try?
>
> Cordially,
>
> Josh
>
> On Tue, Sep 11, 2018 at 3:03 PM Joshua Wall <joshua.e.wall at gmail.com>
> wrote:
>
>> Hello Ryan and Klaus,
>>
>>     Ryan, yes I am running in AMR. Per Klaus's recommendation I just did:
>>
>> josh at iris2:~/flash/src/flash4.2.2-rad/object$ grep -iIn "call Timers" *
>> > timers_grep.txt
>>
>> Which will give me a list of calls to Timers_start and Timers_stop in one
>> text file. After I check this file to ensure that all have pairs, I'll have
>> to check each FLASH file for "if () then" and "#ifdef" statements that
>> might have isolated one of the timer calls. Feel free to also do this as a
>> check in case I miss anything. Whoever finds it first can post what we find
>> here (or if we don't find anything).
>>
>> Cordially,
>>
>> Josh
>>
>> On Tue, Sep 11, 2018 at 2:54 PM Ryan Farber <rjfarber at umich.edu> wrote:
>>
>>> Hi Josh/Klaus,
>>>
>>> I'm also suffering this problem. It only appears for me when I run my
>>> problem in AMR. That is, it doesn't happen if I have nrefs=100000. Josh, do
>>> you know if that is also the case for you?
>>>
>>> Klaus, thanks for the explanation I will look for Timers_start/stop
>>> pairs to see if that is the issue.
>>>
>>> Best,
>>> --------
>>> Ryan
>>>
>>> On Tue, Sep 11, 2018 at 2:44 PM, Klaus Weide <klaus at flash.uchicago.edu>
>>> wrote:
>>>
>>>> On Tue, 11 Sep 2018, Joshua Wall wrote:
>>>>
>>>> > Hello FLASH users,
>>>> >
>>>> > I'm attempting to track down a strange occurrence in one of my runs
>>>> (using
>>>> > FLASH 4.2.2), where I get the following error related to timing of the
>>>> > multigrid solver:
>>>> >
>>>> >     6043  perfmon: ran out of space for timer, "gr_hgBndry", cannot
>>>> time
>>>> > this timer with perfmon
>>>> >     6044  perfmon: ran out of space for timer, "work copy", cannot
>>>> time
>>>> > this timer with perfmon
>>>> >     6045  perfmon: ran out of space for timer, "gr_hgGuardCell",
>>>> cannot
>>>> > time this timer with perfmon
>>>>
>>>> This could happen if Timers_start / Timers_stop call pairs are
>>>> incomplete,
>>>> or not properly nested.
>>>>
>>>> It is possible that the offending unbalanced code is not in the
>>>> multigrid
>>>> solver at all, but in an entirely different part of the code - the
>>>> error
>>>> could just happen to cause an overflow of the Timers stack there first.
>>>>
>>>> Klaus
>>>>
>>>
>>> --
>> Joshua Wall
>> Doctoral Candidate
>> Department of Physics
>> Drexel University
>> 3141 Chestnut Street
>> <https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
>> Philadelphia, PA 19104
>> <https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
>>
> --
> Joshua Wall
> Doctoral Candidate
> Department of Physics
> Drexel University
> 3141 Chestnut Street
> <https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
> Philadelphia, PA 19104
> <https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180911/ce5c80a7/attachment-0001.htm>


More information about the flash-users mailing list