[FLASH-USERS] Timer runs out of space
Joshua Wall
joshua.e.wall at gmail.com
Wed Sep 12 09:27:13 EDT 2018
Ryan,
Thanks for the info. Indeed my issue wasn't that runs were stopping,
but my output was being flooded with the timer error messages about running
out of space.
I'm also fairly sure I'm not overproducing timers or replicating any
specific timer more than once... I made a couple of debugging statements in
Timers_start and Timers_stop that allows me to look for this. I'll attach a
copy for you in case you want to check for the same. Just grep your output
file like this:
cat output.txt | grep -iIn "timer" > timer_grep.txt
and you should see all the timers it makes, including if something with
the same name gets made more than once (the name will repeat for different
indices if I'm reading how the timers are tracked correctly). In my case I
can see that the root process has 52 timers after the first loop completes
and 54 after the first output files are written, then it never grows after
this. I'm honestly not sure why I haven't seen this message before now if
the limit on timers has always been 30... it only appeared as I moved to
running on > 256 processors. Actually this test where I see 54 timers on
the root process still actually has tmr_maxTimerParents=30, and it shows no
error messages from the Timers module (its only running on 120 processors).
Josh
On Tue, Sep 11, 2018, 5:39 PM Ryan Farber <rjfarber at umich.edu> wrote:
> Hi Josh,
>
> I solved the "perfmon" error by copying Timers_data.F90 to my problem
> directory and setting tmr_maxTimerParents = 1000.
>
> Explanation: I noticed that the strings in call Timers_start/stop seemed
> to be much greater than 30 (for me it is of order 100).
> The "perfmon" message is in Timers_start.F90 for the case that j < 0 where
> j gets its value from tmr_stackListAdd().
> tmr_stackListAdd is a subroutine in tmr_stackLib.F90 which passes -1 if
> the referenced string == tmr_maxTimerParents.
>
> However, the "perfmon" message doesn't actually throw an exception. I
> forgot that I've had runs (probably the static mesh refinement ones too but
> I can't check with Stampede2 down for maintenance today) that run okay even
> with this perfmon error.
>
> My actual error message (which I'll still have to debug) is:
> DRIVER_ABORT: Error: Grid_getBlkBoundBox, blockId out of bounds
> blockID = 0
>
> I'm guessing your error is different.
>
> Still, that solves one mystery.
>
> Best,
> --------
> Ryan
>
> On Tue, Sep 11, 2018 at 4:34 PM, Joshua Wall <joshua.e.wall at gmail.com>
> wrote:
>
>> Ryan,
>>
>> Another clue I just found that probably supports Klaus's theory of
>> some timer started that didn't get stopped is the following I found in my
>> log file:
>>
>> [ 09-10-2018 23:31:07.824 ] [Timers_getSummary]: Not writing timer
>> max/min/avg values because not all processors had same timers
>>
>> In case you find the same. To be clear on what I'm running with exactly:
>>
>> Hydro: USM
>> Grav: MG
>> +cube16
>> AMR
>> +pm4dev
>> +supportPPMupwind
>> maxblocks = 50
>> -a -3d
>> optimization: -03
>> MPI: OpenMPI 1.10.02
>> compiler: gnu 4.8.0 (with the -O0 fix for mpi_amr_1blk_guardcell.o)
>>
>>
>> Klaus,
>>
>> Would it be possible to call Timer_getSummary() after each call to the
>> individual units in Driver_evolveFlash in an attempt to find the unit
>> responsible by "bisection"? Essentially the same log message as above
>> should print right after some unit gets the timers on the different
>> processors out of sync. Does that seem okay to try?
>>
>> Cordially,
>>
>> Josh
>>
>> On Tue, Sep 11, 2018 at 3:03 PM Joshua Wall <joshua.e.wall at gmail.com>
>> wrote:
>>
>>> Hello Ryan and Klaus,
>>>
>>> Ryan, yes I am running in AMR. Per Klaus's recommendation I just did:
>>>
>>> josh at iris2:~/flash/src/flash4.2.2-rad/object$ grep -iIn "call Timers" *
>>> > timers_grep.txt
>>>
>>> Which will give me a list of calls to Timers_start and Timers_stop in
>>> one text file. After I check this file to ensure that all have pairs, I'll
>>> have to check each FLASH file for "if () then" and "#ifdef" statements that
>>> might have isolated one of the timer calls. Feel free to also do this as a
>>> check in case I miss anything. Whoever finds it first can post what we find
>>> here (or if we don't find anything).
>>>
>>> Cordially,
>>>
>>> Josh
>>>
>>> On Tue, Sep 11, 2018 at 2:54 PM Ryan Farber <rjfarber at umich.edu> wrote:
>>>
>>>> Hi Josh/Klaus,
>>>>
>>>> I'm also suffering this problem. It only appears for me when I run my
>>>> problem in AMR. That is, it doesn't happen if I have nrefs=100000. Josh, do
>>>> you know if that is also the case for you?
>>>>
>>>> Klaus, thanks for the explanation I will look for Timers_start/stop
>>>> pairs to see if that is the issue.
>>>>
>>>> Best,
>>>> --------
>>>> Ryan
>>>>
>>>> On Tue, Sep 11, 2018 at 2:44 PM, Klaus Weide <klaus at flash.uchicago.edu>
>>>> wrote:
>>>>
>>>>> On Tue, 11 Sep 2018, Joshua Wall wrote:
>>>>>
>>>>> > Hello FLASH users,
>>>>> >
>>>>> > I'm attempting to track down a strange occurrence in one of my runs
>>>>> (using
>>>>> > FLASH 4.2.2), where I get the following error related to timing of
>>>>> the
>>>>> > multigrid solver:
>>>>> >
>>>>> > 6043 perfmon: ran out of space for timer, "gr_hgBndry", cannot
>>>>> time
>>>>> > this timer with perfmon
>>>>> > 6044 perfmon: ran out of space for timer, "work copy", cannot
>>>>> time
>>>>> > this timer with perfmon
>>>>> > 6045 perfmon: ran out of space for timer, "gr_hgGuardCell",
>>>>> cannot
>>>>> > time this timer with perfmon
>>>>>
>>>>> This could happen if Timers_start / Timers_stop call pairs are
>>>>> incomplete,
>>>>> or not properly nested.
>>>>>
>>>>> It is possible that the offending unbalanced code is not in the
>>>>> multigrid
>>>>> solver at all, but in an entirely different part of the code - the
>>>>> error
>>>>> could just happen to cause an overflow of the Timers stack there first.
>>>>>
>>>>> Klaus
>>>>>
>>>>
>>>> --
>>> Joshua Wall
>>> Doctoral Candidate
>>> Department of Physics
>>> Drexel University
>>> 3141 Chestnut Street
>>> <https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
>>> Philadelphia, PA 19104
>>> <https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
>>>
>> --
>> Joshua Wall
>> Doctoral Candidate
>> Department of Physics
>> Drexel University
>> 3141 Chestnut Street
>> <https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
>> Philadelphia, PA 19104
>> <https://maps.google.com/?q=3141+Chestnut+Street+Philadelphia,+PA+19104&entry=gmail&source=g>
>>
>
> --
Joshua Wall
Doctoral Candidate
Department of Physics
Drexel University
3141 Chestnut Street
Philadelphia, PA 19104
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180912/c63fc734/attachment-0001.htm>
-------------- next part --------------
50: Trying to start new timer initialization
51: Timer_startIndex returns 1
57: Trying to start new timer IO_readCheckpoint
58: Timer_startIndex returns 2
181: Trying to stop new timer IO_readCheckpoint
182: Timer_stopIndex returns 2
189: Trying to start new timer amr_refine_derefine
190: Timer_startIndex returns 3
196: Trying to stop new timer amr_refine_derefine
197: Timer_stopIndex returns 3
198: Trying to start new timer updateParticleRefinement
199: Timer_startIndex returns 4
200: Trying to stop new timer updateParticleRefinement
201: Timer_stopIndex returns 4
203: Trying to start new timer guardcell Barrier
204: Timer_startIndex returns 5
205: Trying to stop new timer guardcell Barrier
206: Timer_stopIndex returns 5
207: Trying to start new timer guardcell internal
208: Timer_startIndex returns 6
209: Trying to stop new timer guardcell internal
210: Timer_stopIndex returns 6
218: Trying to start new timer restrictAll
219: Timer_startIndex returns 7
220: Trying to stop new timer restrictAll
221: Timer_stopIndex returns 7
222: Trying to start new timer writePlotfile
223: Timer_startIndex returns 8
224: Trying to stop new timer writePlotfile
225: Timer_stopIndex returns 8
302: Trying to start new timer gravity Barrier
303: Timer_startIndex returns 9
304: Trying to stop new timer gravity Barrier
305: Timer_stopIndex returns 9
306: Trying to start new timer gravity
307: Timer_startIndex returns 10
308: Trying to start new timer Multigrid_solve
309: Timer_startIndex returns 11
310: Trying to start new timer amr_morton_process
311: Timer_startIndex returns 12
312: Trying to stop new timer amr_morton_process
313: Timer_stopIndex returns 12
314: Trying to start new timer gr_restrictTree
315: Timer_startIndex returns 13
316: Trying to stop new timer gr_restrictTree
317: Timer_stopIndex returns 13
318: Trying to start new timer gr_hgNorm
319: Timer_startIndex returns 14
320: Trying to stop new timer gr_hgNorm
321: Timer_stopIndex returns 14
322: Trying to start new timer gr_hgSetZeroBoundary
323: Timer_startIndex returns 15
324: Trying to stop new timer gr_hgSetZeroBoundary
325: Timer_stopIndex returns 15
326: Trying to start new timer solvelevel 1
327: Timer_startIndex returns 16
328: Trying to start new timer fft
329: Timer_startIndex returns 17
330: Trying to stop new timer fft
331: Timer_stopIndex returns 17
332: Trying to start new timer gr_hgBndry
333: Timer_startIndex returns 18
334: Trying to start new timer work copy
335: Timer_startIndex returns 19
336: Trying to stop new timer work copy
337: Timer_stopIndex returns 19
338: Trying to start new timer gr_hgGuardCell
339: Timer_startIndex returns 20
340: Trying to stop new timer gr_hgGuardCell
341: Timer_stopIndex returns 20
342: Trying to start new timer gr_hgResidual
343: Timer_startIndex returns 21
344: Trying to stop new timer gr_hgResidual
345: Timer_stopIndex returns 21
346: Trying to start new timer gr_hgProlongBndries
347: Timer_startIndex returns 22
348: Trying to stop new timer gr_hgProlongBndries
349: Timer_stopIndex returns 22
350: Trying to start new timer solvelevel 2
351: Timer_startIndex returns 23
352: Trying to stop new timer solvelevel 2
353: Timer_stopIndex returns 23
354: Trying to start new timer solvelevel 3
355: Timer_startIndex returns 24
356: Trying to stop new timer solvelevel 3
357: Timer_stopIndex returns 24
358: Trying to start new timer solvelevel 4
359: Timer_startIndex returns 25
360: Trying to stop new timer solvelevel 4
361: Timer_stopIndex returns 25
362: Trying to start new timer solvelevel 5
363: Timer_startIndex returns 26
364: Trying to stop new timer solvelevel 5
365: Timer_stopIndex returns 26
366: Trying to start new timer solvelevel 6
367: Timer_startIndex returns 27
368: Trying to stop new timer solvelevel 6
369: Timer_stopIndex returns 27
370: Trying to start new timer solvelevel 7
371: Timer_startIndex returns 28
372: Trying to stop new timer solvelevel 7
373: Timer_stopIndex returns 28
374: Trying to start new timer gr_hgRestrict
375: Timer_startIndex returns 29
376: Trying to stop new timer gr_hgRestrict
377: Timer_stopIndex returns 29
378: Trying to start new timer Isobnd_Mpole_Solve
379: Timer_startIndex returns 30
380: Trying to stop new timer Isobnd_Mpole_Solve
381: Timer_stopIndex returns 30
382: Trying to start new timer no-data comm_setup
383: Timer_startIndex returns 31
384: Trying to stop new timer no-data comm_setup
385: Timer_stopIndex returns 31
386: Trying to start new timer sendrecv
387: Timer_startIndex returns 32
388: Trying to stop new timer sendrecv
389: Timer_stopIndex returns 32
404: Trying to start new timer evolution
405: Timer_startIndex returns 33
409: Trying to start new timer cosmology
410: Timer_startIndex returns 34
411: Trying to stop new timer cosmology
412: Timer_stopIndex returns 34
413: Trying to start new timer Hydro
414: Timer_startIndex returns 35
415: Trying to start new timer hydro_unsplit
416: Timer_startIndex returns 36
417: Trying to start new timer eos gc
418: Timer_startIndex returns 37
419: Trying to stop new timer eos gc
420: Timer_stopIndex returns 37
421: Trying to start new timer getRiemannState
422: Timer_startIndex returns 38
423: Trying to stop new timer getRiemannState
424: Timer_stopIndex returns 38
425: Trying to start new timer getFaceFlux
426: Timer_startIndex returns 39
427: Trying to stop new timer getFaceFlux
428: Timer_stopIndex returns 39
429: Trying to start new timer unsplitUpdate
430: Timer_startIndex returns 40
431: Trying to stop new timer unsplitUpdate
432: Timer_stopIndex returns 40
433: Trying to start new timer RadTrans
434: Timer_startIndex returns 41
435: Trying to start new timer raytracing
436: Timer_startIndex returns 42
437: Trying to stop new timer raytracing
438: Timer_stopIndex returns 42
439: Trying to start new timer heat
440: Timer_startIndex returns 43
441: Trying to stop new timer heat
442: Timer_stopIndex returns 43
449: Trying to start new timer sourceTerms
450: Timer_startIndex returns 44
451: Trying to stop new timer sourceTerms
452: Timer_stopIndex returns 44
453: Trying to start new timer Particles_advance
454: Timer_startIndex returns 45
501: Trying to stop new timer Particles_advance
502: Timer_stopIndex returns 45
503: Trying to start new timer Gravity potential
504: Timer_startIndex returns 46
507: Trying to stop new timer Gravity potential
508: Timer_stopIndex returns 46
509: Trying to start new timer Grid_updateRefinement
510: Timer_startIndex returns 47
511: Trying to start new timer tree
512: Timer_startIndex returns 48
513: Trying to start new timer markRefineDerefine
514: Timer_startIndex returns 49
515: Trying to stop new timer markRefineDerefine
516: Timer_stopIndex returns 49
522: Trying to start new timer updateData
523: Timer_startIndex returns 50
524: Trying to stop new timer updateData
525: Timer_stopIndex returns 50
526: Trying to start new timer eos
527: Timer_startIndex returns 51
528: Trying to stop new timer eos
529: Timer_stopIndex returns 51
544: Trying to start new timer diagnostics
545: Timer_startIndex returns 52
548: Trying to stop new timer diagnostics
549: Timer_stopIndex returns 52
2057: Trying to start new timer plotfile
2058: Timer_startIndex returns 53
2060: Trying to stop new timer plotfile
2061: Timer_stopIndex returns 53
2062: Trying to start new timer writeParticles
2063: Timer_startIndex returns 54
2064: Trying to stop new timer writeParticles
2065: Timer_stopIndex returns 54
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Timers_start.F90
Type: text/x-fortran
Size: 4020 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180912/c63fc734/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Timers_stop.F90
Type: text/x-fortran
Size: 3380 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20180912/c63fc734/attachment-0003.bin>
More information about the flash-users
mailing list