[FLASH-USERS] Timer runs out of space

Joshua Wall joshua.e.wall at gmail.com
Wed Sep 12 09:27:13 EDT 2018


Ryan,

    Thanks for the info. Indeed my issue wasn't that runs were stopping,
but my output was being flooded with the timer error messages about running
out of space.

    I'm also fairly sure I'm not overproducing timers or replicating any
specific timer more than once... I made a couple of debugging statements in
Timers_start and Timers_stop that allows me to look for this. I'll attach a
copy for you in case you want to check for the same. Just grep your output
file like this:

cat output.txt | grep -iIn "timer" > timer_grep.txt

    and you should see all the timers it makes, including if something with
the same name gets made more than once (the name will repeat for different
indices if I'm reading how the timers are tracked correctly). In my case I
can see that the root process has 52 timers after the first loop completes
and 54 after the first output files are written, then it never grows after
this. I'm honestly not sure why I haven't seen this message before now if
the limit on timers has always been 30... it only appeared as I moved to
running on > 256 processors. Actually this test where I see 54 timers on
the root process still actually has tmr_maxTimerParents=30, and it shows no
error messages from the Timers module (its only running on 120 processors).

Josh


On Tue, Sep 11, 2018, 5:39 PM Ryan Farber <rjfarber at umich.edu> wrote:

> Hi Josh,
>
> I solved the "perfmon" error by copying Timers_data.F90 to my problem
> directory and setting tmr_maxTimerParents = 1000.
>
> Explanation: I noticed that the strings in call Timers_start/stop seemed
> to be much greater than 30 (for me it is of order 100).
> The "perfmon" message is in Timers_start.F90 for the case that j < 0 where
> j gets its value from tmr_stackListAdd().
> tmr_stackListAdd is a subroutine in tmr_stackLib.F90 which passes -1 if
> the referenced string == tmr_maxTimerParents.
>
> However, the "perfmon" message doesn't actually throw an exception. I
> forgot that I've had runs (probably the static mesh refinement ones too but
> I can't check with Stampede2 down for maintenance today) that run okay even
> with this perfmon error.
>
> My actual error message (which I'll still have to debug) is:
> DRIVER_ABORT: Error: Grid_getBlkBoundBox, blockId out of bounds
> blockID = 0
>
> I'm guessing your error is different.
>
> Still, that solves one mystery.
>
> Best,
> --------
> Ryan
>
> On Tue, Sep 11, 2018 at 4:34 PM, Joshua Wall <joshua.e.wall at gmail.com>
> wrote:
>
>> Ryan,
>>
>>    Another clue I just found that probably supports Klaus's theory of
>> some timer started that didn't get stopped is the following I found in my
>> log file:
>>
>> [ 09-10-2018  23:31:07.824 ] [Timers_getSummary]: Not writing timer
>> max/min/avg values because not all processors had same timers
>>
>> In case you find the same. To be clear on what I'm running with exactly:
>>
>> Hydro: USM
>> Grav: MG
>> +cube16
>> AMR
>> +pm4dev
>> +supportPPMupwind
>> maxblocks = 50
>> -a -3d
>> optimization: -03
>> MPI: OpenMPI 1.10.02
>> compiler: gnu 4.8.0 (with the -O0 fix for mpi_amr_1blk_guardcell.o)
>>
>>
>> Klaus,
>>
>>    Would it be possible to call Timer_getSummary() after each call to the
>> individual units in Driver_evolveFlash in an attempt to find the unit
>> responsible by "bisection"? Essentially the same log message as above
>> should print right after some unit gets the timers on the different
>> processors out of sync. Does that seem okay to try?
>>
>> Cordially,
>>
>> Josh
>>
>> On Tue, Sep 11, 2018 at 3:03 PM Joshua Wall <joshua.e.wall at gmail.com>
>> wrote:
>>
>>> Hello Ryan and Klaus,
>>>
>>>     Ryan, yes I am running in AMR. Per Klaus's recommendation I just did:
>>>
>>> josh at iris2:~/flash/src/flash4.2.2-rad/object$ grep -iIn "call Timers" *
>>> > timers_grep.txt
>>>
>>> Which will give me a list of calls to Timers_start and Timers_stop in
>>> one text file. After I check this file to ensure that all have pairs, I'll
>>> have to check each FLASH file for "if () then" and "#ifdef" statements that
>>> might have isolated one of the timer calls. Feel free to also do this as a
>>> check in case I miss anything. Whoever finds it first can post what we find
>>> here (or if we don't find anything).
>>>
>>> Cordially,
>>>
>>> Josh
>>>
>>> On Tue, Sep 11, 2018 at 2:54 PM Ryan Farber <rjfarber at umich.edu> wrote:
>>>
>>>> Hi Josh/Klaus,
>>>>
>>>> I'm also suffering this problem. It only appears for me when I run my
>>>> problem in AMR. That is, it doesn't happen if I have nrefs=100000. Josh, do
>>>> you know if that is also the case for you?
>>>>
>>>> Klaus, thanks for the explanation I will look for Timers_start/stop
>>>> pairs to see if that is the issue.
>>>>
>>>> Best,
>>>> --------
>>>> Ryan
>>>>
>>>> On Tue, Sep 11, 2018 at 2:44 PM, Klaus Weide <klaus at flash.uchicago.edu>
>>>> wrote:
>>>>
>>>>> On Tue, 11 Sep 2018, Joshua Wall wrote:
>>>>>
>>>>> > Hello FLASH users,
>>>>> >
>>>>> > I'm attempting to track down a strange occurrence in one of my runs
>>>>> (using
>>>>> > FLASH 4.2.2), where I get the following error related to timing of
>>>>> the
>>>>> > multigrid solver:
>>>>> >
>>>>> >     6043  perfmon: ran out of space for timer, "gr_hgBndry", cannot
>>>>> time
>>>>> > this timer with perfmon
>>>>> >     6044  perfmon: ran out of space for timer, "work copy", cannot
>>>>> time
>>>>> > this timer with perfmon
>>>>> >     6045  perfmon: ran out of space for timer, "gr_hgGuardCell",
>>>>> cannot
>>>>> > time this timer with perfmon
>>>>>
>>>>> This could happen if Timers_start / Timers_stop call pairs are
>>>>> incomplete,
>>>>> or not properly nested.
>>>>>
>>>>> It is possible that the offending unbalanced code is not in the
>>>>> multigrid
>>>>> solver at all, but in an entirely different part of the code - the
>>>>> error
>>>>> could just happen to cause an overflow of the Timers stack there first.
>>>>>
>>>>> Klaus
>>>>>
>>>>
>>>> --
>>> Joshua Wall
>>> Doctoral Candidate
>>> Department of Physics
>>> Drexel University
>>> 3141 Chestnut Street
>>> <https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
>>> Philadelphia, PA 19104
>>> <https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
>>>
>> --
>> Joshua Wall
>> Doctoral Candidate
>> Department of Physics
>> Drexel University
>> 3141 Chestnut Street
>> <https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
>> Philadelphia, PA 19104
>> <https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
>>
>
> --
Joshua Wall
Doctoral Candidate
Department of Physics
Drexel University
3141 Chestnut Street
Philadelphia, PA 19104
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180912/c63fc734/attachment.htm>
-------------- next part --------------
50: Trying to start new timer initialization
51: Timer_startIndex returns            1
57: Trying to start new timer IO_readCheckpoint
58: Timer_startIndex returns            2
181: Trying to stop new timer IO_readCheckpoint
182: Timer_stopIndex returns            2
189: Trying to start new timer amr_refine_derefine
190: Timer_startIndex returns            3
196: Trying to stop new timer amr_refine_derefine
197: Timer_stopIndex returns            3
198: Trying to start new timer updateParticleRefinement
199: Timer_startIndex returns            4
200: Trying to stop new timer updateParticleRefinement
201: Timer_stopIndex returns            4
203: Trying to start new timer guardcell Barrier
204: Timer_startIndex returns            5
205: Trying to stop new timer guardcell Barrier
206: Timer_stopIndex returns            5
207: Trying to start new timer guardcell internal
208: Timer_startIndex returns            6
209: Trying to stop new timer guardcell internal
210: Timer_stopIndex returns            6
218: Trying to start new timer restrictAll
219: Timer_startIndex returns            7
220: Trying to stop new timer restrictAll
221: Timer_stopIndex returns            7
222: Trying to start new timer writePlotfile
223: Timer_startIndex returns            8
224: Trying to stop new timer writePlotfile
225: Timer_stopIndex returns            8
302: Trying to start new timer gravity Barrier
303: Timer_startIndex returns            9
304: Trying to stop new timer gravity Barrier
305: Timer_stopIndex returns            9
306: Trying to start new timer gravity
307: Timer_startIndex returns           10
308: Trying to start new timer Multigrid_solve
309: Timer_startIndex returns           11
310: Trying to start new timer amr_morton_process
311: Timer_startIndex returns           12
312: Trying to stop new timer amr_morton_process
313: Timer_stopIndex returns           12
314: Trying to start new timer gr_restrictTree
315: Timer_startIndex returns           13
316: Trying to stop new timer gr_restrictTree
317: Timer_stopIndex returns           13
318: Trying to start new timer gr_hgNorm
319: Timer_startIndex returns           14
320: Trying to stop new timer gr_hgNorm
321: Timer_stopIndex returns           14
322: Trying to start new timer gr_hgSetZeroBoundary
323: Timer_startIndex returns           15
324: Trying to stop new timer gr_hgSetZeroBoundary
325: Timer_stopIndex returns           15
326: Trying to start new timer  solvelevel 1                           
327: Timer_startIndex returns           16
328: Trying to start new timer fft
329: Timer_startIndex returns           17
330: Trying to stop new timer fft
331: Timer_stopIndex returns           17
332: Trying to start new timer gr_hgBndry
333: Timer_startIndex returns           18
334: Trying to start new timer work copy
335: Timer_startIndex returns           19
336: Trying to stop new timer work copy
337: Timer_stopIndex returns           19
338: Trying to start new timer gr_hgGuardCell
339: Timer_startIndex returns           20
340: Trying to stop new timer gr_hgGuardCell
341: Timer_stopIndex returns           20
342: Trying to start new timer gr_hgResidual
343: Timer_startIndex returns           21
344: Trying to stop new timer gr_hgResidual
345: Timer_stopIndex returns           21
346: Trying to start new timer gr_hgProlongBndries
347: Timer_startIndex returns           22
348: Trying to stop new timer gr_hgProlongBndries
349: Timer_stopIndex returns           22
350: Trying to start new timer  solvelevel 2                           
351: Timer_startIndex returns           23
352: Trying to stop new timer  solvelevel 2                           
353: Timer_stopIndex returns           23
354: Trying to start new timer  solvelevel 3                           
355: Timer_startIndex returns           24
356: Trying to stop new timer  solvelevel 3                           
357: Timer_stopIndex returns           24
358: Trying to start new timer  solvelevel 4                           
359: Timer_startIndex returns           25
360: Trying to stop new timer  solvelevel 4                           
361: Timer_stopIndex returns           25
362: Trying to start new timer  solvelevel 5                           
363: Timer_startIndex returns           26
364: Trying to stop new timer  solvelevel 5                           
365: Timer_stopIndex returns           26
366: Trying to start new timer  solvelevel 6                           
367: Timer_startIndex returns           27
368: Trying to stop new timer  solvelevel 6                           
369: Timer_stopIndex returns           27
370: Trying to start new timer  solvelevel 7                           
371: Timer_startIndex returns           28
372: Trying to stop new timer  solvelevel 7                           
373: Timer_stopIndex returns           28
374: Trying to start new timer gr_hgRestrict
375: Timer_startIndex returns           29
376: Trying to stop new timer gr_hgRestrict
377: Timer_stopIndex returns           29
378: Trying to start new timer Isobnd_Mpole_Solve
379: Timer_startIndex returns           30
380: Trying to stop new timer Isobnd_Mpole_Solve
381: Timer_stopIndex returns           30
382: Trying to start new timer no-data comm_setup
383: Timer_startIndex returns           31
384: Trying to stop new timer no-data comm_setup
385: Timer_stopIndex returns           31
386: Trying to start new timer sendrecv
387: Timer_startIndex returns           32
388: Trying to stop new timer sendrecv
389: Timer_stopIndex returns           32
404: Trying to start new timer evolution
405: Timer_startIndex returns           33
409: Trying to start new timer cosmology
410: Timer_startIndex returns           34
411: Trying to stop new timer cosmology
412: Timer_stopIndex returns           34
413: Trying to start new timer Hydro
414: Timer_startIndex returns           35
415: Trying to start new timer hydro_unsplit
416: Timer_startIndex returns           36
417: Trying to start new timer eos gc
418: Timer_startIndex returns           37
419: Trying to stop new timer eos gc
420: Timer_stopIndex returns           37
421: Trying to start new timer getRiemannState
422: Timer_startIndex returns           38
423: Trying to stop new timer getRiemannState
424: Timer_stopIndex returns           38
425: Trying to start new timer getFaceFlux
426: Timer_startIndex returns           39
427: Trying to stop new timer getFaceFlux
428: Timer_stopIndex returns           39
429: Trying to start new timer unsplitUpdate
430: Timer_startIndex returns           40
431: Trying to stop new timer unsplitUpdate
432: Timer_stopIndex returns           40
433: Trying to start new timer RadTrans
434: Timer_startIndex returns           41
435: Trying to start new timer raytracing
436: Timer_startIndex returns           42
437: Trying to stop new timer raytracing
438: Timer_stopIndex returns           42
439: Trying to start new timer heat
440: Timer_startIndex returns           43
441: Trying to stop new timer heat
442: Timer_stopIndex returns           43
449: Trying to start new timer sourceTerms
450: Timer_startIndex returns           44
451: Trying to stop new timer sourceTerms
452: Timer_stopIndex returns           44
453: Trying to start new timer Particles_advance
454: Timer_startIndex returns           45
501: Trying to stop new timer Particles_advance
502: Timer_stopIndex returns           45
503: Trying to start new timer Gravity potential
504: Timer_startIndex returns           46
507: Trying to stop new timer Gravity potential
508: Timer_stopIndex returns           46
509: Trying to start new timer Grid_updateRefinement
510: Timer_startIndex returns           47
511: Trying to start new timer tree
512: Timer_startIndex returns           48
513: Trying to start new timer markRefineDerefine
514: Timer_startIndex returns           49
515: Trying to stop new timer markRefineDerefine
516: Timer_stopIndex returns           49
522: Trying to start new timer updateData
523: Timer_startIndex returns           50
524: Trying to stop new timer updateData
525: Timer_stopIndex returns           50
526: Trying to start new timer eos
527: Timer_startIndex returns           51
528: Trying to stop new timer eos
529: Timer_stopIndex returns           51
544: Trying to start new timer diagnostics
545: Timer_startIndex returns           52
548: Trying to stop new timer diagnostics
549: Timer_stopIndex returns           52
2057: Trying to start new timer plotfile
2058: Timer_startIndex returns           53
2060: Trying to stop new timer plotfile
2061: Timer_stopIndex returns           53
2062: Trying to start new timer writeParticles
2063: Timer_startIndex returns           54
2064: Trying to stop new timer writeParticles
2065: Timer_stopIndex returns           54
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Timers_start.F90
Type: text/x-fortran
Size: 4020 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180912/c63fc734/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Timers_stop.F90
Type: text/x-fortran
Size: 3380 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180912/c63fc734/attachment-0001.bin>


More information about the flash-users mailing list