<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:DengXian;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"\@DengXian";
panose-1:2 1 6 0 3 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
span.EmailStyle18
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;
font-family:"Calibri",sans-serif;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal">Hi Ryan,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I managed to get it working consistently by reducing both the memory usage and the number of nodes (went from 30 nodes to 20).
<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Best,<o:p></o:p></p>
<p class="MsoNormal">Tim<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> Ryan Farber <rjfarber@umich.edu> <br>
<b>Sent:</b> Wednesday, September 13, 2023 11:01 AM<br>
<b>To:</b> Timothy Mark Johnson <tmarkj@mit.edu><br>
<b>Cc:</b> flash-users@flash.rochester.edu<br>
<b>Subject:</b> Re: [FLASH-USERS] Issue restarting FLASH - hangs after [GRID amr_refine_derefine]: refinement complete<o:p></o:p></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal">Hi Tim,<o:p></o:p></p>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Excuse the very brief response but historically I've found allowing for more of an overhead in memory helps with this issue.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Best,<br clear="all">
<o:p></o:p></p>
<div>
<div>
<div>
<div>
<div>
<div>
<p class="MsoNormal">--------<o:p></o:p></p>
<div>
<p class="MsoNormal">Ryan<o:p></o:p></p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">On Tue, Sep 12, 2023 at 11:08 AM Timothy Mark Johnson <<a href="mailto:tmarkj@mit.edu">tmarkj@mit.edu</a>> wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<div>
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Hi FLASH users,<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I’ve been trying to run some relatively large 3D simulations but I’m consistently running into issues restarting from the checkpoint files. The checkpoint files are about 150 GB
and seem to be read in just fine. The code loads it in, then freezes after the AMR refinement is complete. It will stay here indefinitely. I’m running on 30 nodes each with 32 cores. All the files live in a luster filesystem.
<br>
<br>
I’ve managed to restart it sometimes by moving the checkpoint file to different locations, but this has been pretty hit or miss. The supercomputer also might be giving me different nodes between tries so it might be an issue with specific nodes. Maybe the nodes
it give me are too far apart? I’m not sure if that’s realistic though…<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Has anyone else had issues restarting large simulations? I wonder how much of this might be a result of issues with the supercomputer. I’ve attached my terminal output and the .log
file. Please let me know if additional information would be helpful.<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Best,<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Tim Johnson<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal">_______________________________________________<br>
flash-users mailing list<br>
<a href="mailto:flash-users@flash.rochester.edu" target="_blank">flash-users@flash.rochester.edu</a><br>
<br>
For list info, including unsubscribe:<br>
<a href="https://flash.rochester.edu/mailman/listinfo/flash-users" target="_blank">https://flash.rochester.edu/mailman/listinfo/flash-users</a><o:p></o:p></p>
</div>
</blockquote>
</div>
</div>
</body>
</html>