[FLASH-USERS] restarting FLASH

mateuszr at umich.edu mateuszr at umich.edu
Mon Aug 25 14:38:48 EDT 2008


   Hi Paul,

thanks for coming back to me on this. Here are additional details:

1) units: default cgs

2) IO mode: default serial
(since the last e-mail I also ran the same simulation w/ parallel IO,  
then stopped it and tried to restart. The code behavior was different:  
it did read the checkpoint file and proceeded to the evolution stage  
taking just one (n=1) step and then just hung in there (though  
technically it did not crash)

3) I do have one problem-specific runtime parameter that is only used  
at the initialization stage and that is never important for restarts.

Do let me know if you need any more information.

   thanks,
     Mateusz

P.S. Btw, I did do some simple Sedov case restart experiment to check  
if the problem is not with the cluster itself. I was able to restart  
the code without any problem in this test case.




Quoting "Paul M. Rich" <richp at flash.uchicago.edu>:

> Mateusz,
>
> We could use some more information so that we can help you figure  
> this out.  What Flash units are you using in your setup? Which ones  
> have you customized or overridden in your setup, particularly if any  
> of the initializations were changed?  Are there any unusual runtime  
> parameters being used?  Which IO mode is this?
>
> This information will help us narrow down where to look considerably.
>
> Thanks,
>
> Paul Rich
> ------------------------------
> ASC Flash Center
> University of Chicago
> richp at flash.uchicago.edu
>
>
> mateuszr at umich.edu wrote:
>>
>>  Hi all,
>>
>> I am having trouble restarting the code. More specifically, I am  
>> getting a segmentation fault when I try to restart from a valid  
>> checkpoint file. The details are enclosed below. I would be vary  
>> grateful for some clues.
>>
>>  thanks,
>>    Mateusz
>>
>>
>> [mateuszr at galaxy ~/FLASH]$ more out.txt
>>
>> [io_readData] Opening test_hdf5_chk_0011 for restart
>> rank 1 in job 2  galaxy001.astro.lsa.umich.edu_38345   caused  
>> collective abort of all ranks
>>  exit status of rank 1: killed by signal 9
>>
>> [mateuszr at galaxy ~/FLASH]$ more flash_run.err
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> Image              PC                Routine            Line        Source
>> flash3             000000000050CC48  Unknown               Unknown  Unknown
>> flash3             000000000043EEBA  Unknown               Unknown  Unknown
>> flash3             000000000043C359  Unknown               Unknown  Unknown
>> flash3             0000000000410513  Unknown               Unknown  Unknown
>> flash3             0000000000415C19  Unknown               Unknown  Unknown
>> flash3             0000000000406982  Unknown               Unknown  Unknown
>> libc.so.6          00000035AFE1D8A4  Unknown               Unknown  Unknown
>> flash3             00000000004068A9  Unknown               Unknown  Unknown
>> touch: cannot touch `/scratch///.keep': Permission denied
>>
>
>
>
>




More information about the flash-users mailing list