[FLASH-USERS] restarting FLASH
mateuszr at umich.edu
mateuszr at umich.edu
Mon Aug 25 14:38:48 EDT 2008
Hi Paul,
thanks for coming back to me on this. Here are additional details:
1) units: default cgs
2) IO mode: default serial
(since the last e-mail I also ran the same simulation w/ parallel IO,
then stopped it and tried to restart. The code behavior was different:
it did read the checkpoint file and proceeded to the evolution stage
taking just one (n=1) step and then just hung in there (though
technically it did not crash)
3) I do have one problem-specific runtime parameter that is only used
at the initialization stage and that is never important for restarts.
Do let me know if you need any more information.
thanks,
Mateusz
P.S. Btw, I did do some simple Sedov case restart experiment to check
if the problem is not with the cluster itself. I was able to restart
the code without any problem in this test case.
Quoting "Paul M. Rich" <richp at flash.uchicago.edu>:
> Mateusz,
>
> We could use some more information so that we can help you figure
> this out. What Flash units are you using in your setup? Which ones
> have you customized or overridden in your setup, particularly if any
> of the initializations were changed? Are there any unusual runtime
> parameters being used? Which IO mode is this?
>
> This information will help us narrow down where to look considerably.
>
> Thanks,
>
> Paul Rich
> ------------------------------
> ASC Flash Center
> University of Chicago
> richp at flash.uchicago.edu
>
>
> mateuszr at umich.edu wrote:
>>
>> Hi all,
>>
>> I am having trouble restarting the code. More specifically, I am
>> getting a segmentation fault when I try to restart from a valid
>> checkpoint file. The details are enclosed below. I would be vary
>> grateful for some clues.
>>
>> thanks,
>> Mateusz
>>
>>
>> [mateuszr at galaxy ~/FLASH]$ more out.txt
>>
>> [io_readData] Opening test_hdf5_chk_0011 for restart
>> rank 1 in job 2 galaxy001.astro.lsa.umich.edu_38345 caused
>> collective abort of all ranks
>> exit status of rank 1: killed by signal 9
>>
>> [mateuszr at galaxy ~/FLASH]$ more flash_run.err
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> Image PC Routine Line Source
>> flash3 000000000050CC48 Unknown Unknown Unknown
>> flash3 000000000043EEBA Unknown Unknown Unknown
>> flash3 000000000043C359 Unknown Unknown Unknown
>> flash3 0000000000410513 Unknown Unknown Unknown
>> flash3 0000000000415C19 Unknown Unknown Unknown
>> flash3 0000000000406982 Unknown Unknown Unknown
>> libc.so.6 00000035AFE1D8A4 Unknown Unknown Unknown
>> flash3 00000000004068A9 Unknown Unknown Unknown
>> touch: cannot touch `/scratch///.keep': Permission denied
>>
>
>
>
>
More information about the flash-users
mailing list