[FLASH-USERS] restarting FLASH

Paul M. Rich richp at flash.uchicago.edu
Mon Aug 25 15:07:47 EDT 2008


Mateusz,

My guess is that it is a unit that initializes before IO runs in  a 
restart that is overrunning memory.  Grid and Multispecies, particularly 
if you have overrode anything there, would be a good place to look.  You 
could also try using array bounds checking if your compiler supports 
it.  Among the units that Driver_initFlash starts before IO comes in a 
restart are: Grid, Particles, MaterialProperties and Multispecies.

Also, if the runtime parameter is not being used on a restart, I would 
look and check that the code it controls is not being executed on a restart.

Does this help?

Paul Rich

mateuszr at umich.edu wrote:
>
>   Hi Paul,
>
> thanks for coming back to me on this. Here are additional details:
>
> 1) units: default cgs
>
> 2) IO mode: default serial
> (since the last e-mail I also ran the same simulation w/ parallel IO, 
> then stopped it and tried to restart. The code behavior was different: 
> it did read the checkpoint file and proceeded to the evolution stage 
> taking just one (n=1) step and then just hung in there (though 
> technically it did not crash)
>
> 3) I do have one problem-specific runtime parameter that is only used 
> at the initialization stage and that is never important for restarts.
>
> Do let me know if you need any more information.
>
>   thanks,
>     Mateusz
>
> P.S. Btw, I did do some simple Sedov case restart experiment to check 
> if the problem is not with the cluster itself. I was able to restart 
> the code without any problem in this test case.
>
>
>
>
> Quoting "Paul M. Rich" <richp at flash.uchicago.edu>:
>
>> Mateusz,
>>
>> We could use some more information so that we can help you figure 
>> this out.  What Flash units are you using in your setup? Which ones 
>> have you customized or overridden in your setup, particularly if any 
>> of the initializations were changed?  Are there any unusual runtime 
>> parameters being used?  Which IO mode is this?
>>
>> This information will help us narrow down where to look considerably.
>>
>> Thanks,
>>
>> Paul Rich
>> ------------------------------
>> ASC Flash Center
>> University of Chicago
>> richp at flash.uchicago.edu
>>
>>
>> mateuszr at umich.edu wrote:
>>>
>>>  Hi all,
>>>
>>> I am having trouble restarting the code. More specifically, I am 
>>> getting a segmentation fault when I try to restart from a valid 
>>> checkpoint file. The details are enclosed below. I would be vary 
>>> grateful for some clues.
>>>
>>>  thanks,
>>>    Mateusz
>>>
>>>
>>> [mateuszr at galaxy ~/FLASH]$ more out.txt
>>>
>>> [io_readData] Opening test_hdf5_chk_0011 for restart
>>> rank 1 in job 2  galaxy001.astro.lsa.umich.edu_38345   caused 
>>> collective abort of all ranks
>>>  exit status of rank 1: killed by signal 9
>>>
>>> [mateuszr at galaxy ~/FLASH]$ more flash_run.err
>>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>>> Image              PC                Routine            Line        
>>> Source
>>> flash3             000000000050CC48  Unknown               Unknown  
>>> Unknown
>>> flash3             000000000043EEBA  Unknown               Unknown  
>>> Unknown
>>> flash3             000000000043C359  Unknown               Unknown  
>>> Unknown
>>> flash3             0000000000410513  Unknown               Unknown  
>>> Unknown
>>> flash3             0000000000415C19  Unknown               Unknown  
>>> Unknown
>>> flash3             0000000000406982  Unknown               Unknown  
>>> Unknown
>>> libc.so.6          00000035AFE1D8A4  Unknown               Unknown  
>>> Unknown
>>> flash3             00000000004068A9  Unknown               Unknown  
>>> Unknown
>>> touch: cannot touch `/scratch///.keep': Permission denied
>>>
>>
>>
>>
>>




More information about the flash-users mailing list