<div dir="ltr">Dear Klaus,<div><br></div><div>our cluster is quite responsive today and I already have the output from two independent runs:</div><div><br></div><div>The new debug lines tell us:</div><div><div><font face="monospace, monospace"> 50995 9.3378E-01 8.2062E-06 ( 1.932E+00, 1.411E-01, 0.000E+00) | 8.206E-06</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 0 11928</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 1 6</font></div><div><font face="monospace, monospace">On proc 29: block 440@ 29 still needs to move to 1@ 30.</font></div><div><font face="monospace, monospace">On proc 15: block 441@ 15 still needs to move to 1@ 16.</font></div><div><font face="monospace, monospace">On proc 21: block 441@ 21 still needs to move to 1@ 22.</font></div><div><font face="monospace, monospace">On proc 14: block 442@ 14 still needs to move to 1@ 15.</font></div><div><font face="monospace, monospace">On proc 8: block 442@ 8 still needs to move to 1@ 9.</font></div><div><font face="monospace, monospace">On proc 29: block 441@ 29 still needs to move to 2@ 30.</font></div><div><font face="monospace, monospace">On proc 21: block 441@ 21 still needs to move to 1@ 22.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 2 1</font></div><div><font face="monospace, monospace">On proc 21: block 441@ 21 still needs to move to 1@ 22.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 3 1</font></div><div><font face="monospace, monospace">On proc 21: block 441@ 21 still needs to move to 1@ 22.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 4 1</font></div><div><font face="monospace, monospace">On proc 21: block 441@ 21 still needs to move to 1@ 22.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 5 1</font></div><div><font face="monospace, monospace">On proc 21: block 441@ 21 still needs to move to 1@ 22.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 6 1</font></div><div><font face="monospace, monospace">On proc 21: block 441@ 21 still needs to move to 1@ 22.</font></div></div><div><font face="monospace, monospace">[...]</font></div><div><div><font face="monospace, monospace"> iteration, no. not moved = 95 1</font></div><div><font face="monospace, monospace">On proc 21: block 441@ 21 still needs to move to 1@ 22.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 96 1</font></div><div><font face="monospace, monospace">On proc 21: block 441@ 21 still needs to move to 1@ 22.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 97 1</font></div><div><font face="monospace, monospace">On proc 21: block 441@ 21 still needs to move to 1@ 22.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 98 1</font></div><div><font face="monospace, monospace">On proc 21: block 441@ 21 still needs to move to 1@ 22.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 99 1</font></div><div><font face="monospace, monospace">On proc 21: block 441@ 21 still needs to move to 1@ 22.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 100 1</font></div><div><font face="monospace, monospace">On proc 21: block 441@ 21 still needs to move to 1@ 22.</font></div><div><font face="monospace, monospace"> ERROR: could not move all blocks in amr_redist_blk</font></div><div><font face="monospace, monospace"> Try increasing maxblocks or use more processors</font></div><div><font face="monospace, monospace"> nm2_old, nm2 = 1 1</font></div><div><font face="monospace, monospace"> ABORTING !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<wbr>!!!!!!!!!!</font></div></div><div><br></div><div>This time, the routines have been asked to decrease the number of blocks (unlike before):</div><div><br></div><div><font face="monospace, monospace"> [ 03-08-2017 12:28:18.714 ] [GRID amr_refine_derefine]: redist. phase. tot blks requested: 28258<br></font></div><div><font face="monospace, monospace"> [...]</font></div><div><font face="monospace, monospace"> [ 03-08-2017 12:29:09.939 ] [GRID amr_refine_derefine]: redist. phase. tot blks requested: 28254</font></div><div><font face="monospace, monospace"> (end of log file)</font></div><div><br></div><div>I have queued the same simulation again and received a similar error:</div><div><br></div><div><div><font face="monospace, monospace"> 50899 9.3221E-01 8.1590E-06 ( 1.932E+00, 1.411E-01, 0.000E+00) | 8.159E-06</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 0 2726</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 1 2</font></div><div><font face="monospace, monospace">On proc 52: block 441@ 52 still needs to move to 1@ 53.</font></div><div><font face="monospace, monospace">On proc 49: block 442@ 49 still needs to move to 1@ 50.</font></div><div><font face="monospace, monospace">On proc 49: block 442@ 49 still needs to move to 1@ 50.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 2 1</font></div><div><font face="monospace, monospace">On proc 49: block 442@ 49 still needs to move to 1@ 50.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 3 1</font></div><div><font face="monospace, monospace">On proc 49: block 442@ 49 still needs to move to 1@ 50.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 4 1</font></div><div><font face="monospace, monospace">On proc 49: block 442@ 49 still needs to move to 1@ 50.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 5 1</font></div><div><font face="monospace, monospace">On proc 49: block 442@ 49 still needs to move to 1@ 50.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 6 1</font></div><div><font face="monospace, monospace">On proc 49: block 442@ 49 still needs to move to 1@ 50.</font></div></div><div><font face="monospace, monospace">[...]</font></div><div><div><font face="monospace, monospace"> iteration, no. not moved = 98 1</font></div><div><font face="monospace, monospace">On proc 49: block 442@ 49 still needs to move to 1@ 50.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 99 1</font></div><div><font face="monospace, monospace">On proc 49: block 442@ 49 still needs to move to 1@ 50.</font></div><div><font face="monospace, monospace">On proc 49: block 442@ 49 still needs to move to 1@ 50.</font></div><div><font face="monospace, monospace">On proc 49: block 442@ 49 still needs to move to 1@ 50.</font></div><div><font face="monospace, monospace"> iteration, no. not moved = 100 1</font></div><div><font face="monospace, monospace">On proc 49: block 442@ 49 still needs to move to 1@ 50.</font></div><div><font face="monospace, monospace"> ERROR: could not move all blocks in amr_redist_blk</font></div><div><font face="monospace, monospace"> Try increasing maxblocks or use more processors</font></div><div><font face="monospace, monospace"> nm2_old, nm2 = 1 1</font></div><div><font face="monospace, monospace"> ABORTING !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<wbr>!!!!!!!!!!</font></div></div><div><br></div><div>According to the log file, the block number should increase from 28210 to 28214</div><div><br></div><div>Best regards,</div><div>Dominik</div><div class="gmail_extra"><br><div class="gmail_quote">2017-03-08 12:25 GMT+01:00 Dominik Derigs <span dir="ltr"><<a href="mailto:derigs@ph1.uni-koeln.de" target="_blank">derigs@ph1.uni-koeln.de</a>></span>:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Klaus,<div><br></div><div>Thank you for your messages.<br><div><br></div><div>Some answers to the question you mentioned:</div><span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Are you using face variables?</blockquote></span><div>No.</div><span><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Does the same happen if you<br> - use the older Paramesh implementation (setup with +pm40) ?<br> - or, alternatively, set the following runtime parameters to .false. ?<br> use_flash_surr_blks_fill<br> use_reduced_orrery</blockquote></span><div>I'll have to try this.</div><span><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span style="font-size:12.8px">Is Paramesh trying to increase or to decrease the number of blocks when<br></span><span style="font-size:12.8px">this happens?</span></blockquote></span><div><span style="font-size:12.8px">It tried to increase by 4 blocks, these are the most recent messages:</span></div><div><font face="monospace, monospace"><span style="font-size:12.8px"> [ 03-07-2017 01:59:00.137 ] [GRID amr_refine_derefine]: redist. phase. tot blks requested: 26466</span><br></font></div><div><span style="font-size:12.8px"><font face="monospace, monospace"> [...]</font></span></div><div><font face="monospace, monospace"><span style="font-size:12.8px"> [ 03-07-2017 01:59:51.245 ] [GRID amr_refine_derefine]: redist. phase. tot blks requested: 26470</span><br></font></div><div><span style="font-size:12.8px"><font face="monospace, monospace"> (end of log file)</font></span></div><div><span style="font-size:12.8px"><br></span></div><div><span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span style="font-size:12.8px">Have you made any unusual changes to Paramesh?</span></blockquote></span><div style="font-size:12.8px">No.</div></div><div style="font-size:12.8px"><br></div><div><span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span style="font-size:12.8px">In particular, is it still true that</span><span style="font-size:12.8px"><br></span><span style="font-size:12.8px"> maxblocks_tr = 10*maxblocks</span><span style="font-size:12.8px"><br></span><span style="font-size:12.8px">(as per amr_initialize.F90) ?</span></blockquote></span><div> Yes.</div></div><div><br></div><div><span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Meta-information has already been<br>moved and modified at this point, under the assumption that movement of<br>the contents of all blocks succeeds. it would be inconsistent if that last<br>blocks then does not actually get moved.</blockquote></span><div>I was afraid that this might have been the case.</div><div><br></div><div>I added the debug code you suggested in your follow-up mail and queued a new simulation.</div><div><br></div><div>Best regards,</div><div>Dominik </div></div></div></div><div class="gmail_extra"><div><div class="m_8353599347279614112gmail-m_-3176483868074916645h5"><br><div class="gmail_quote">2017-03-08 4:57 GMT+01:00 Klaus Weide <span dir="ltr"><<a href="mailto:klaus@flash.uchicago.edu" target="_blank">klaus@flash.uchicago.edu</a>></span>:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span>On Tue, 7 Mar 2017, Dominik Derigs wrote:<br>
<br>
> Dear FLASH users,<br>
><br>
> I'm seeing a problem for quite some time on our local cluster which I don't<br>
> seem able to get rid of. Whenever I run a sufficiently large simulation, it<br>
> will fail sooner or later while paramesh writes this to the output:<br>
><br>
> 45251 8.5418E-01 5.7059E-06 ( 7.080E-02, 8.350E-02, 0.000E+00) |<br>
> 5.706E-06<br>
> iteration, no. not moved = 0 4629<br>
> iteration, no. not moved = 1 3<br>
> iteration, no. not moved = 2 1<br>
> iteration, no. not moved = 3 1<br>
> iteration, no. not moved = 4 1<br>
> [...]<br>
> iteration, no. not moved = 98 1<br>
> iteration, no. not moved = 99 1<br>
> iteration, no. not moved = 100 1<br>
> ERROR: could not move all blocks in amr_redist_blk<br>
> Try increasing maxblocks or use more processors<br>
> nm2_old, nm2 = 1 1<br>
> ABORTING !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<wbr>!!!!!!!!!!<br>
<br>
</span>Hi Dominik,<br>
<br>
I have not encountered this problem myself.<br>
<br>
Please provide some more info about your setup.<br>
Some particular questions:<br>
<br>
Are you using face variables?<br>
<br>
Does the same happen if you<br>
- use the older Paramesh implementation (setup with +pm40) ?<br>
- or, alternatively, set the following runtime parameters to .false. ?<br>
<br>
use_flash_surr_blks_fill<br>
use_reduced_orrery<br>
<br>
Is Paramesh trying to increase or to decrease the number of blocks when<br>
this happens? (The long file may show this information, perhaps in a<br>
message like this:<br>
... [GRID amr_refine_derefine]: redist. phase. tot blks requested: 453<br>
<br>
Have you made any unusual changes to Paramesh?<br>
In particular, is it still true that<br>
<br>
maxblocks_tr = 10*maxblocks<br>
<br>
(as per amr_initialize.F90) ?<br>
<span><br>
<br>
> Do you know how to prevent this error from happening or - if not - if it is<br>
> safe to remove the corresponding MPI_ABORT entirely and just work with one<br>
> block not being shifted around correctly?<br>
<br>
</span>I do not think this would be safe. Meta-information has already been<br>
moved and modified at this point, under the assumption that movement of<br>
the contents of all blocks succeeds. it would be inconsistent if that last<br>
blocks then does not actually get moved.<br>
<span class="m_8353599347279614112gmail-m_-3176483868074916645m_-2461266940673876431HOEnZb"><font color="#888888"><br>
Klaus<br>
<br>
</font></span></blockquote></div><br><br clear="all"><div><br></div></div></div><span>-- <br><div class="m_8353599347279614112gmail-m_-3176483868074916645m_-2461266940673876431gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Dominik Derigs<br>I. Physikalisches Institut<br>Universität zu Köln<br>Zülpicher Straße 77<br>50937 Köln<br>GERMANY</div><div dir="ltr"><br></div><div dir="ltr"><a href="https://hera.ph1.uni-koeln.de/~derigs/" target="_blank">https://hera.ph1.uni-koeln.de/<wbr>~derigs/</a><br><br>Tel. (+49|0) 221 470-8352<br>Fax. (+49|0) 221 470-5162<br><br>Diese Email ist vertraulich und nur für den angegebenen Empfänger bestimmt. Zugang, Freigabe, die Kopie, die Verteilung oder Weiterleitung durch jemand anderen außer dem Empfänger selbst ist verboten und kann eine kriminelle Handlung sein. Bitte löschen Sie die Email, wenn Sie sie durch einen Fehler erhalten haben und informieren Sie den Absender.<br><br>This email and any files transmitted with it may contain confidential and/or privileged material and is intended only for the person or entity to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you have received this email in error, please notify the sender immediately and delete this material from all known records.</div></div></div></div></div></div></div></div>
</span></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="m_8353599347279614112gmail-m_-3176483868074916645gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Dominik Derigs<br>I. Physikalisches Institut<br>Universität zu Köln<br>Zülpicher Straße 77<br>50937 Köln<br>GERMANY</div><div dir="ltr"><br></div><div dir="ltr"><a href="https://hera.ph1.uni-koeln.de/~derigs/" target="_blank">https://hera.ph1.uni-koeln.de/<wbr>~derigs/</a><br><br>Tel. (+49|0) 221 470-8352<br>Fax. (+49|0) 221 470-5162<br><br>Diese Email ist vertraulich und nur für den angegebenen Empfänger bestimmt. Zugang, Freigabe, die Kopie, die Verteilung oder Weiterleitung durch jemand anderen außer dem Empfänger selbst ist verboten und kann eine kriminelle Handlung sein. Bitte löschen Sie die Email, wenn Sie sie durch einen Fehler erhalten haben und informieren Sie den Absender.<br><br>This email and any files transmitted with it may contain confidential and/or privileged material and is intended only for the person or entity to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you have received this email in error, please notify the sender immediately and delete this material from all known records.</div></div></div></div></div></div></div></div>
</div></div>