<div dir="ltr"><div>Tomek, I don't quite understand what you wrote. Won't all processes execute the same code unless embedded by some conditional to exclude them (e.g., "if (my_rank == MASTER_PE) then .... endif")? Any such conditional would be noticeable by looking at the source code between a Timers_start and a Timers_stop.<br></div><div><br></div><div>Another thing to note is that grepping just for "call Timers" is not quite general enough because there are also "#IO_TIMERS_START/STOP(blah)" pairs of statements as well.</div><div><br></div><div>However, I didn't notice anything wrong with those pairs in my case either. I'm content with just setting tmr_maxTimerParents = suitably large number as a quick fix.<br></div><div class="gmail_extra"><br></div><div class="gmail_extra">Best,<br clear="all"></div><div class="gmail_extra"><div><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div dir="ltr">--------<div>Ryan</div></div></div></div></div></div></div>
<br><div class="gmail_quote">On Wed, Sep 12, 2018 at 9:45 AM, Tomasz Plewa <span dir="ltr"><<a href="mailto:tplewa@fsu.edu" target="_blank">tplewa@fsu.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Apart from a simple missing start/stop pairing issues, one or more timers start/stop calls might be embedded in the code section which is not executed by all participating processes. The latter cannot be identified by analyzing the source code and needs tracking of timer calls from individual processes.<br>
<br>
Tomek<br>
--<span class=""><br>
On 09/12/18 09:27, Joshua Wall wrote:<br>
</span><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">
Ryan,<br>
<br>
    Thanks for the info. Indeed my issue wasn't that runs were stopping, but my output was being flooded with the timer error messages about running out of space.<br>
<br>
    I'm also fairly sure I'm not overproducing timers or replicating any specific timer more than once... I made a couple of debugging statements in Timers_start and Timers_stop that allows me to look for this. I'll attach a copy for you in case you want to check for the same. Just grep your output file like this:<br>
<br>
cat output.txt | grep -iIn "timer" > timer_grep.txt<br>
<br>
    and you should see all the timers it makes, including if something with the same name gets made more than once (the name will repeat for different indices if I'm reading how the timers are tracked correctly). In my case I can see that the root process has 52 timers after the first loop completes and 54 after the first output files are written, then it never grows after this. I'm honestly not sure why I haven't seen this message before now if the limit on timers has always been 30... it only appeared as I moved to running on > 256 processors. Actually this test where I see 54 timers on the root process still actually has tmr_maxTimerParents=30, and it shows no error messages from the Timers module (its only running on 120 processors).<br>
<br>
Josh<br>
<br>
<br></span><span class="">
On Tue, Sep 11, 2018, 5:39 PM Ryan Farber <<a href="mailto:rjfarber@umich.edu" target="_blank">rjfarber@umich.edu</a> <mailto:<a href="mailto:rjfarber@umich.edu" target="_blank">rjfarber@umich.edu</a>>> wrote:<br>
<br>
    Hi Josh,<br>
<br>
    I solved the "perfmon" error by copying Timers_data.F90 to my<br>
    problem directory and setting tmr_maxTimerParents = 1000.<br>
<br>
    Explanation: I noticed that the strings in call Timers_start/stop<br>
    seemed to be much greater than 30 (for me it is of order 100).<br>
    The "perfmon" message is in Timers_start.F90 for the case that j <<br>
    0 where j gets its value from tmr_stackListAdd().<br>
    tmr_stackListAdd is a subroutine in tmr_stackLib.F90 which passes<br>
    -1 if the referenced string == tmr_maxTimerParents.<br>
<br>
    However, the "perfmon" message doesn't actually throw an<br>
    exception. I forgot that I've had runs (probably the static mesh<br>
    refinement ones too but I can't check with Stampede2 down for<br>
    maintenance today) that run okay even with this perfmon error.<br>
<br>
    My actual error message (which I'll still have to debug) is:<br>
    DRIVER_ABORT: Error: Grid_getBlkBoundBox, blockId out of bounds<br>
    blockID = 0<br>
<br>
    I'm guessing your error is different.<br>
<br>
    Still, that solves one mystery.<br>
<br>
    Best,<br>
    --------<br>
    Ryan<br>
<br>
    On Tue, Sep 11, 2018 at 4:34 PM, Joshua Wall<br></span><div><div class="h5">
    <<a href="mailto:joshua.e.wall@gmail.com" target="_blank">joshua.e.wall@gmail.com</a> <mailto:<a href="mailto:joshua.e.wall@gmail.com" target="_blank">joshua.e.wall@gmail.co<wbr>m</a>>> wrote:<br>
<br>
        Ryan,<br>
<br>
           Another clue I just found that probably supports Klaus's<br>
        theory of some timer started that didn't get stopped is the<br>
        following I found in my log file:<br>
<br>
        [ 09-10-2018  23:31:07.824 ] [Timers_getSummary]: Not writing<br>
        timer max/min/avg values because not all processors had same<br>
        timers<br>
<br>
        In case you find the same. To be clear on what I'm running<br>
        with exactly:<br>
<br>
        Hydro: USM<br>
        Grav: MG<br>
        +cube16<br>
        AMR<br>
        +pm4dev<br>
        +supportPPMupwind<br>
        maxblocks = 50<br>
        -a -3d<br>
        optimization: -03<br>
        MPI: OpenMPI 1.10.02<br>
        compiler: gnu 4.8.0 (with the -O0 fix for<br>
        mpi_amr_1blk_guardcell.o)<br>
<br>
<br>
        Klaus,<br>
<br>
           Would it be possible to call Timer_getSummary() after each<br>
        call to the individual units in Driver_evolveFlash in an<br>
        attempt to find the unit responsible by "bisection"?<br>
        Essentially the same log message as above should print right<br>
        after some unit gets the timers on the different processors<br>
        out of sync. Does that seem okay to try?<br>
<br>
        Cordially,<br>
<br>
        Josh<br>
<br>
        On Tue, Sep 11, 2018 at 3:03 PM Joshua Wall<br></div></div><span class="">
        <<a href="mailto:joshua.e.wall@gmail.com" target="_blank">joshua.e.wall@gmail.com</a> <mailto:<a href="mailto:joshua.e.wall@gmail.com" target="_blank">joshua.e.wall@gmail.co<wbr>m</a>>> wrote:<br>
<br>
            Hello Ryan and Klaus,<br>
<br>
                Ryan, yes I am running in AMR. Per Klaus's<br>
            recommendation I just did:<br>
<br>
            josh@iris2:~/flash/src/flash4.<wbr>2.2-rad/object$ grep -iIn<br>
            "call Timers" * > timers_grep.txt<br>
<br>
            Which will give me a list of calls to Timers_start and<br>
            Timers_stop in one text file. After I check this file to<br>
            ensure that all have pairs, I'll have to check each FLASH<br>
            file for "if () then" and "#ifdef" statements that might<br>
            have isolated one of the timer calls. Feel free to also do<br>
            this as a check in case I miss anything. Whoever finds it<br>
            first can post what we find here (or if we don't find<br>
            anything).<br>
<br>
            Cordially,<br>
<br>
            Josh<br>
<br>
            On Tue, Sep 11, 2018 at 2:54 PM Ryan Farber<br></span><span class="">
            <<a href="mailto:rjfarber@umich.edu" target="_blank">rjfarber@umich.edu</a> <mailto:<a href="mailto:rjfarber@umich.edu" target="_blank">rjfarber@umich.edu</a>>> wrote:<br>
<br>
                Hi Josh/Klaus,<br>
<br>
                I'm also suffering this problem. It only appears for<br>
                me when I run my problem in AMR. That is, it doesn't<br>
                happen if I have nrefs=100000. Josh, do you know if<br>
                that is also the case for you?<br>
<br>
                Klaus, thanks for the explanation I will look for<br>
                Timers_start/stop pairs to see if that is the issue.<br>
<br>
                Best,<br>
                --------<br>
                Ryan<br>
<br>
                On Tue, Sep 11, 2018 at 2:44 PM, Klaus Weide<br>
                <<a href="mailto:klaus@flash.uchicago.edu" target="_blank">klaus@flash.uchicago.edu</a><br></span><div><div class="h5">
                <mailto:<a href="mailto:klaus@flash.uchicago.edu" target="_blank">klaus@flash.uchicago.e<wbr>du</a>>> wrote:<br>
<br>
                    On Tue, 11 Sep 2018, Joshua Wall wrote:<br>
<br>
                    > Hello FLASH users,<br>
                    ><br>
                    > I'm attempting to track down a strange<br>
                    occurrence in one of my runs (using<br>
                    > FLASH 4.2.2), where I get the following error<br>
                    related to timing of the<br>
                    > multigrid solver:<br>
                    ><br>
                    >     6043 perfmon: ran out of space for timer,<br>
                    "gr_hgBndry", cannot time<br>
                    > this timer with perfmon<br>
                    >     6044 perfmon: ran out of space for timer,<br>
                    "work copy", cannot time<br>
                    > this timer with perfmon<br>
                    >     6045 perfmon: ran out of space for timer,<br>
                    "gr_hgGuardCell", cannot<br>
                    > time this timer with perfmon<br>
<br>
                    This could happen if Timers_start / Timers_stop<br>
                    call pairs are incomplete,<br>
                    or not properly nested.<br>
<br>
                    It is possible that the offending unbalanced code<br>
                    is not in the multigrid<br>
                    solver at all, but in an entirely different part<br>
                    of the code - the error<br>
                    could just happen to cause an overflow of the<br>
                    Timers stack there first.<br>
<br>
                    Klaus<br>
<br>
<br>
            --             Joshua Wall<br>
            Doctoral Candidate<br>
            Department of Physics<br>
            Drexel University<br>
            3141 Chestnut Street<br></div></div>
            <<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladelphia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOCBqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint<wbr>.com/v2/url?u=https-3A__maps.<wbr>google.com_-3Fq-3D3141-<wbr>2BChestnut-2BStreet-2BPhiladel<wbr>phia-2C-2BPA-2B19104-26entry-<wbr>3Dgmail-26source-3Dg&d=DwMFaQ&<wbr>c=HPMtquzZjKY31rtkyGRFnQ&r=<wbr>5DS_vsYq-qjViHV5fAPhDTdGVwLEOC<wbr>BqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_Z<wbr>xOJzPwp8Lm3p4thDx5cs_j37WrxPI&<wbr>s=8L7jbs4kmQK4JhIh9GIj5gYjIgHn<wbr>inObBq5AX8hqMx8&e=</a>><br>
            Philadelphia, PA 19104<br>
            <<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladelphia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOCBqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint<wbr>.com/v2/url?u=https-3A__maps.<wbr>google.com_-3Fq-3D3141-<wbr>2BChestnut-2BStreet-2BPhiladel<wbr>phia-2C-2BPA-2B19104-26entry-<wbr>3Dgmail-26source-3Dg&d=DwMFaQ&<wbr>c=HPMtquzZjKY31rtkyGRFnQ&r=<wbr>5DS_vsYq-qjViHV5fAPhDTdGVwLEOC<wbr>BqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_Z<wbr>xOJzPwp8Lm3p4thDx5cs_j37WrxPI&<wbr>s=8L7jbs4kmQK4JhIh9GIj5gYjIgHn<wbr>inObBq5AX8hqMx8&e=</a>><span class=""><br>
<br>
        --         Joshua Wall<br>
        Doctoral Candidate<br>
        Department of Physics<br>
        Drexel University<br>
        3141 Chestnut Street<br></span>
        <<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladelphia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOCBqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint<wbr>.com/v2/url?u=https-3A__maps.<wbr>google.com_-3Fq-3D3141-<wbr>2BChestnut-2BStreet-2BPhiladel<wbr>phia-2C-2BPA-2B19104-26entry-<wbr>3Dgmail-26source-3Dg&d=DwMFaQ&<wbr>c=HPMtquzZjKY31rtkyGRFnQ&r=<wbr>5DS_vsYq-qjViHV5fAPhDTdGVwLEOC<wbr>BqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_Z<wbr>xOJzPwp8Lm3p4thDx5cs_j37WrxPI&<wbr>s=8L7jbs4kmQK4JhIh9GIj5gYjIgHn<wbr>inObBq5AX8hqMx8&e=</a>><br>
        Philadelphia, PA 19104<br>
        <<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D3141-2BChestnut-2BStreet-2BPhiladelphia-2C-2BPA-2B19104-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=5DS_vsYq-qjViHV5fAPhDTdGVwLEOCBqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_ZxOJzPwp8Lm3p4thDx5cs_j37WrxPI&s=8L7jbs4kmQK4JhIh9GIj5gYjIgHninObBq5AX8hqMx8&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint<wbr>.com/v2/url?u=https-3A__maps.<wbr>google.com_-3Fq-3D3141-<wbr>2BChestnut-2BStreet-2BPhiladel<wbr>phia-2C-2BPA-2B19104-26entry-<wbr>3Dgmail-26source-3Dg&d=DwMFaQ&<wbr>c=HPMtquzZjKY31rtkyGRFnQ&r=<wbr>5DS_vsYq-qjViHV5fAPhDTdGVwLEOC<wbr>BqtKUqVlnfdXE&m=ffUJ2NSJs5Wd_Z<wbr>xOJzPwp8Lm3p4thDx5cs_j37WrxPI&<wbr>s=8L7jbs4kmQK4JhIh9GIj5gYjIgHn<wbr>inObBq5AX8hqMx8&e=</a>><span class=""><br>
<br>
<br>
-- <br>
Joshua Wall<br>
Doctoral Candidate<br>
Department of Physics<br>
Drexel University<br>
3141 Chestnut Street<br>
Philadelphia, PA 19104<br>
</span></blockquote>
<br>
</blockquote></div><br></div></div>