[FLASH-USERS] restarting FLASH

mateuszr at umich.edu mateuszr at umich.edu
Mon Aug 25 14:38:48 EDT 2008

   Hi Paul,

thanks for coming back to me on this. Here are additional details:

1) units: default cgs

2) IO mode: default serial
(since the last e-mail I also ran the same simulation w/ parallel IO,  
then stopped it and tried to restart. The code behavior was different:  
it did read the checkpoint file and proceeded to the evolution stage  
taking just one (n=1) step and then just hung in there (though  
technically it did not crash)

3) I do have one problem-specific runtime parameter that is only used  
at the initialization stage and that is never important for restarts.

Do let me know if you need any more information.


P.S. Btw, I did do some simple Sedov case restart experiment to check  
if the problem is not with the cluster itself. I was able to restart  
the code without any problem in this test case.

Quoting "Paul M. Rich" <richp at flash.uchicago.edu>:

> Mateusz,
> We could use some more information so that we can help you figure  
> this out.  What Flash units are you using in your setup? Which ones  
> have you customized or overridden in your setup, particularly if any  
> of the initializations were changed?  Are there any unusual runtime  
> parameters being used?  Which IO mode is this?
> This information will help us narrow down where to look considerably.
> Thanks,
> Paul Rich
> ------------------------------
> ASC Flash Center
> University of Chicago
> richp at flash.uchicago.edu
> mateuszr at umich.edu wrote:
>>  Hi all,
>> I am having trouble restarting the code. More specifically, I am  
>> getting a segmentation fault when I try to restart from a valid  
>> checkpoint file. The details are enclosed below. I would be vary  
>> grateful for some clues.
>>  thanks,
>>    Mateusz
>> [mateuszr at galaxy ~/FLASH]$ more out.txt
>> [io_readData] Opening test_hdf5_chk_0011 for restart
>> rank 1 in job 2  galaxy001.astro.lsa.umich.edu_38345   caused  
>> collective abort of all ranks
>>  exit status of rank 1: killed by signal 9
>> [mateuszr at galaxy ~/FLASH]$ more flash_run.err
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> Image              PC                Routine            Line        Source
>> flash3             000000000050CC48  Unknown               Unknown  Unknown
>> flash3             000000000043EEBA  Unknown               Unknown  Unknown
>> flash3             000000000043C359  Unknown               Unknown  Unknown
>> flash3             0000000000410513  Unknown               Unknown  Unknown
>> flash3             0000000000415C19  Unknown               Unknown  Unknown
>> flash3             0000000000406982  Unknown               Unknown  Unknown
>> libc.so.6          00000035AFE1D8A4  Unknown               Unknown  Unknown
>> flash3             00000000004068A9  Unknown               Unknown  Unknown
>> touch: cannot touch `/scratch///.keep': Permission denied

More information about the flash-users mailing list