[FLASH-USERS] Restart

Mousumi Das mdas at umich.edu
Fri Dec 28 11:21:17 EST 2007


Hello All,
    Even if I increase the frequency of the checkpointing, that is not 
going to help, bacause the code stops running while writting the 2nd 
checkpoint file always, chk_0002. Log file shows the follwoing error mesg.

  [ 12-28-2007  01:45.36 ] step: n=163 t=1.969343E-09 dt=8.117542E-12
  [ 12-28-2007  01:47.06 ] step: n=164 t=1.985578E-09 dt=8.124520E-12
  [ 12-28-2007  01:48.50 ] file_wr_open: type=checkpoint name=Ni_Tem_hdf5_chk_0002

If I see the output file it gives the error mesg
  [CHECKPOINT_WR] Writing checkpoint file Ni_Tem_hdf5_chk_0002
  Progress:  |.........
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2abd18f1e8
*** End of error message ***

I have checked there is enough space for wrting files. Did anyone of you 
face such problem before?
thanks,
Mousumi


On Mon, 17 Dec 2007, Anshu Dubey wrote:

> The checkpoint and plotfile numbers are not related to each other.
> For the plotfile, the number only provides the number from where
> the plotfile count starts upon restart.
>
> But if your second checkpoint didn't get written correctly, I am afraid
> you will have to restart from the first checkpoint, and repeat all the
> steps in between. You might want to increase the frequence of
> checkpointing if you are running into IO problems, that way you won't
> have to rollback as much.
>
> Anshu
>>
>>    I am using FLASH2.5. My programme stops while writing a checkpint file
>> after outputing several plot files, but it has not completed the run till
>> tstop value.
>>
>>   [ 12-14-2007  21:45.22 ] step: n=364 t=5.630710E-09 dt=1.014811E-11
>>   [ 12-14-2007  21:48.17 ] [AMR_REFINE_DEREFINE]: refinement initiated at
>> 21:48.1 7
>>   [ 12-14-2007  21:49.17 ] [AMR_REFINE_DEREFINE] blocks   all:  min=1326
>> max=1336  tot=13321
>>   [ 12-14-2007  21:49.17 ] [AMR_REFINE_DEREFINE] blocks valid:  min=1163
>> max=1167  tot=11656
>>   [ 12-14-2007  21:49.17 ] [AMR_REFINE_DEREFINE]: refinement complete
>>   [ 12-14-2007  21:49.32 ] file_wr_open: type=checkpoint
>> name=Ni_Tem_hdf5_chk_000 2
>>
>>
>> I want to restart the simulation with previous checkpointpoint file
>> Ni_Tem_hdf5_chk_0001. According to FLASH manual I change the restart
>> logical variable to .true.
>> cpunumber I specfied 0001 ( last written checkpointfile)
>> pltnumber 0001 ( last written plotfile number after 0001 check point)
>>
>> Although the last written pltnumber is 0056. But there is gap between
>> checkpoint file chk_0001 and chk_0002.
>>
>> But with this I am not able to restart the simulation. How do I restart
>> the simulation run if I want to get the plotfile (0057) after the last
>> plotfile generated (0056).
>> thanks,
>> Mousumi
>>
>> On Mon, 17 Dec 2007, Anshu Dubey wrote:
>>
>>> People have used the parallel hdf5 module in Flash2.5 at the center
>>> and still do, but on many platforms the restart is extremely slow.
>>> We haven't been able to determine the cause. However, fortunately,
>>> the problem didn't get carried over to Flash3, so we strongly
>>> recommend switching over to Flash3 if it is possible for you.
>>>
>>> Anshu
>>>
>>>> Hi all,
>>>>
>>>>   I have a serious problem when using the parallel hdf5 module to
>>>> restart from a checkpoint file. It takes about half an hour to read a
>>>> single variable to each node for a 1GB file. The same problem was
>>>> reported two year ago
>>>> (http://flash.uchicago.edu/pipermail/flash-users/2005-April/001938.html)
>>>> but I cannot find the follow-ups. Did anyone succeed in using the
>>>> parallel hdf5 module so far? Will the Flash center continue support
>>>> such
>>>> issue on Flash2.5? Thanks!
>>>>
>>>> Bests,
>>>> Shikui
>>>>
>>>
>>>
>>>
>>>
>>
>
>
>
>



More information about the flash-users mailing list