[FLASH-USERS] restarting FLASH
Paul M. Rich
richp at flash.uchicago.edu
Mon Aug 25 15:07:47 EDT 2008
Mateusz,
My guess is that it is a unit that initializes before IO runs in a
restart that is overrunning memory. Grid and Multispecies, particularly
if you have overrode anything there, would be a good place to look. You
could also try using array bounds checking if your compiler supports
it. Among the units that Driver_initFlash starts before IO comes in a
restart are: Grid, Particles, MaterialProperties and Multispecies.
Also, if the runtime parameter is not being used on a restart, I would
look and check that the code it controls is not being executed on a restart.
Does this help?
Paul Rich
mateuszr at umich.edu wrote:
>
> Hi Paul,
>
> thanks for coming back to me on this. Here are additional details:
>
> 1) units: default cgs
>
> 2) IO mode: default serial
> (since the last e-mail I also ran the same simulation w/ parallel IO,
> then stopped it and tried to restart. The code behavior was different:
> it did read the checkpoint file and proceeded to the evolution stage
> taking just one (n=1) step and then just hung in there (though
> technically it did not crash)
>
> 3) I do have one problem-specific runtime parameter that is only used
> at the initialization stage and that is never important for restarts.
>
> Do let me know if you need any more information.
>
> thanks,
> Mateusz
>
> P.S. Btw, I did do some simple Sedov case restart experiment to check
> if the problem is not with the cluster itself. I was able to restart
> the code without any problem in this test case.
>
>
>
>
> Quoting "Paul M. Rich" <richp at flash.uchicago.edu>:
>
>> Mateusz,
>>
>> We could use some more information so that we can help you figure
>> this out. What Flash units are you using in your setup? Which ones
>> have you customized or overridden in your setup, particularly if any
>> of the initializations were changed? Are there any unusual runtime
>> parameters being used? Which IO mode is this?
>>
>> This information will help us narrow down where to look considerably.
>>
>> Thanks,
>>
>> Paul Rich
>> ------------------------------
>> ASC Flash Center
>> University of Chicago
>> richp at flash.uchicago.edu
>>
>>
>> mateuszr at umich.edu wrote:
>>>
>>> Hi all,
>>>
>>> I am having trouble restarting the code. More specifically, I am
>>> getting a segmentation fault when I try to restart from a valid
>>> checkpoint file. The details are enclosed below. I would be vary
>>> grateful for some clues.
>>>
>>> thanks,
>>> Mateusz
>>>
>>>
>>> [mateuszr at galaxy ~/FLASH]$ more out.txt
>>>
>>> [io_readData] Opening test_hdf5_chk_0011 for restart
>>> rank 1 in job 2 galaxy001.astro.lsa.umich.edu_38345 caused
>>> collective abort of all ranks
>>> exit status of rank 1: killed by signal 9
>>>
>>> [mateuszr at galaxy ~/FLASH]$ more flash_run.err
>>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>>> Image PC Routine Line
>>> Source
>>> flash3 000000000050CC48 Unknown Unknown
>>> Unknown
>>> flash3 000000000043EEBA Unknown Unknown
>>> Unknown
>>> flash3 000000000043C359 Unknown Unknown
>>> Unknown
>>> flash3 0000000000410513 Unknown Unknown
>>> Unknown
>>> flash3 0000000000415C19 Unknown Unknown
>>> Unknown
>>> flash3 0000000000406982 Unknown Unknown
>>> Unknown
>>> libc.so.6 00000035AFE1D8A4 Unknown Unknown
>>> Unknown
>>> flash3 00000000004068A9 Unknown Unknown
>>> Unknown
>>> touch: cannot touch `/scratch///.keep': Permission denied
>>>
>>
>>
>>
>>
More information about the flash-users
mailing list