[FLASH-USERS] FLASH crashing: "iteration, no. not moved"

Dominik Derigs derigs at ph1.uni-koeln.de
Tue Mar 7 08:45:15 EST 2017


Dear FLASH users,

I'm seeing a problem for quite some time on our local cluster which I don't
seem able to get rid of. Whenever I run a sufficiently large simulation, it
will fail sooner or later while paramesh writes this to the output:

   45251 8.5418E-01 5.7059E-06  ( 7.080E-02,  8.350E-02,  0.000E+00) |
 5.706E-06
  iteration, no. not moved =            0        4629
  iteration, no. not moved =            1           3
  iteration, no. not moved =            2           1
  iteration, no. not moved =            3           1
  iteration, no. not moved =            4           1
[...]
  iteration, no. not moved =           98           1
  iteration, no. not moved =           99           1
  iteration, no. not moved =          100           1
  ERROR: could not move all blocks in amr_redist_blk
  Try increasing maxblocks or use more processors
  nm2_old, nm2 =            1           1
  ABORTING !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

My current simulation (2 dimensions) runs on 64 cores with maxblocks set to
2'000, allowing a total maximum number of blocks of 128'000. With the
current simulation I have 26'470 blocks in total (19'853 leaf blocks), so
I'm well below that limit.

Accordingly, increasing maxblocks doesn't make any difference, neither does
allowing "nit" to go to a higher value than 100. Running on an even more
cores is no solution, either (initially, I have been running on only 16,
then 32 and now 64 cores).

It may or not be interesting that I'm using the Intel compilers (v17.0)
along with Intel MPI (v5.0.3).

It happens both with FLASH 4.3 and 4.4 (I have not tried other versions of
FLASH). I cannot reporduce this issue reliably. Restarting from one of the
recent checkpoints lets it run through perfectly fine:

   45251 8.5418E-01 5.7059E-06  ( 7.080E-02,  8.350E-02,  0.000E+00) |
 5.706E-06
  iteration, no. not moved =            0        4629
  iteration, no. not moved =            1           2
  iteration, no. not moved =            2           0
 refined: total leaf blocks =        19853
 refined: total blocks =        26470
   45252 8.5419E-01 5.7065E-06  ( 7.080E-02,  8.350E-02,  0.000E+00) |
 5.707E-06

Do you know how to prevent this error from happening or - if not - if it is
safe to remove the corresponding MPI_ABORT entirely and just work with one
block not being shifted around correctly?
Best regards,
Dominik

-- 
Dominik Derigs
I. Physikalisches Institut
Universität zu Köln
Zülpicher Straße 77
50937 Köln
GERMANY

https://hera.ph1.uni-koeln.de/~derigs/

Tel. (+49|0) 221 470-8352
Fax. (+49|0) 221 470-5162

Diese Email ist vertraulich und nur für den angegebenen Empfänger bestimmt.
Zugang, Freigabe, die Kopie, die Verteilung oder Weiterleitung durch jemand
anderen außer dem Empfänger selbst ist verboten und kann eine kriminelle
Handlung sein. Bitte löschen Sie die Email, wenn Sie sie durch einen Fehler
erhalten haben und informieren Sie den Absender.

This email and any files transmitted with it may contain confidential
and/or privileged material and is intended only for the person or entity to
which it is addressed. Any review, retransmission, dissemination or other
use of, or taking of any action in reliance upon, this information by
persons or entities other than the intended recipient is prohibited. If you
have received this email in error, please notify the sender immediately and
delete this material from all known records.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20170307/447e0884/attachment.htm>


More information about the flash-users mailing list