[FLASH-USERS] Issue restarting FLASH - hangs after [GRID amr_refine_derefine]: refinement complete

Timothy Mark Johnson tmarkj at mit.edu
Tue Sep 12 11:06:34 EDT 2023


Hi FLASH users,

I've been trying to run some relatively large 3D simulations but I'm consistently running into issues restarting from the checkpoint files. The checkpoint files are about 150 GB and seem to be read in just fine. The code loads it in, then freezes after the AMR refinement is complete. It will stay here indefinitely. I'm running on 30 nodes each with 32 cores. All the files live in a luster filesystem.

I've managed to restart it sometimes by moving the checkpoint file to different locations, but this has been pretty hit or miss. The supercomputer also might be giving me different nodes between tries so it might be an issue with specific nodes. Maybe the nodes it give me are too far apart? I'm not sure if that's realistic though...

Has anyone else had issues restarting large simulations? I wonder how much of this might be a result of issues with the supercomputer. I've attached my terminal output and the .log file. Please let me know if additional information would be helpful.

Best,
Tim Johnson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20230912/d24a3cf0/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gasjetexp.log
Type: application/octet-stream
Size: 85622 bytes
Desc: gasjetexp.log
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20230912/d24a3cf0/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm_output_51653507.log
Type: application/octet-stream
Size: 3498 bytes
Desc: slurm_output_51653507.log
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20230912/d24a3cf0/attachment-0001.obj>


More information about the flash-users mailing list