[FLASH-USERS] Restart

Anshu Dubey dubey at flash.uchicago.edu
Fri Dec 28 11:27:28 EST 2007


Try switching the IO mode between hdf5_serial and hdf5_parallel and see
if you do any better.

> Hello All,
>     Even if I increase the frequency of the checkpointing, that is not
> going to help, bacause the code stops running while writting the 2nd
> checkpoint file always, chk_0002. Log file shows the follwoing error mesg.
>
>   [ 12-28-2007  01:45.36 ] step: n=163 t=1.969343E-09 dt=8.117542E-12
>   [ 12-28-2007  01:47.06 ] step: n=164 t=1.985578E-09 dt=8.124520E-12
>   [ 12-28-2007  01:48.50 ] file_wr_open: type=checkpoint
> name=Ni_Tem_hdf5_chk_0002
>
> If I see the output file it gives the error mesg
>   [CHECKPOINT_WR] Writing checkpoint file Ni_Tem_hdf5_chk_0002
>   Progress:  |.........
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0x2abd18f1e8
> *** End of error message ***
>
> I have checked there is enough space for wrting files. Did anyone of you
> face such problem before?
> thanks,
> Mousumi
>
>
> On Mon, 17 Dec 2007, Anshu Dubey wrote:
>
>> The checkpoint and plotfile numbers are not related to each other.
>> For the plotfile, the number only provides the number from where
>> the plotfile count starts upon restart.
>>
>> But if your second checkpoint didn't get written correctly, I am afraid
>> you will have to restart from the first checkpoint, and repeat all the
>> steps in between. You might want to increase the frequence of
>> checkpointing if you are running into IO problems, that way you won't
>> have to rollback as much.
>>
>> Anshu
>>>
>>>    I am using FLASH2.5. My programme stops while writing a checkpint
>>> file
>>> after outputing several plot files, but it has not completed the run
>>> till
>>> tstop value.
>>>
>>>   [ 12-14-2007  21:45.22 ] step: n=364 t=5.630710E-09 dt=1.014811E-11
>>>   [ 12-14-2007  21:48.17 ] [AMR_REFINE_DEREFINE]: refinement initiated
>>> at
>>> 21:48.1 7
>>>   [ 12-14-2007  21:49.17 ] [AMR_REFINE_DEREFINE] blocks   all:
>>> min=1326
>>> max=1336  tot=13321
>>>   [ 12-14-2007  21:49.17 ] [AMR_REFINE_DEREFINE] blocks valid:
>>> min=1163
>>> max=1167  tot=11656
>>>   [ 12-14-2007  21:49.17 ] [AMR_REFINE_DEREFINE]: refinement complete
>>>   [ 12-14-2007  21:49.32 ] file_wr_open: type=checkpoint
>>> name=Ni_Tem_hdf5_chk_000 2
>>>
>>>
>>> I want to restart the simulation with previous checkpointpoint file
>>> Ni_Tem_hdf5_chk_0001. According to FLASH manual I change the restart
>>> logical variable to .true.
>>> cpunumber I specfied 0001 ( last written checkpointfile)
>>> pltnumber 0001 ( last written plotfile number after 0001 check point)
>>>
>>> Although the last written pltnumber is 0056. But there is gap between
>>> checkpoint file chk_0001 and chk_0002.
>>>
>>> But with this I am not able to restart the simulation. How do I restart
>>> the simulation run if I want to get the plotfile (0057) after the last
>>> plotfile generated (0056).
>>> thanks,
>>> Mousumi
>>>
>>> On Mon, 17 Dec 2007, Anshu Dubey wrote:
>>>
>>>> People have used the parallel hdf5 module in Flash2.5 at the center
>>>> and still do, but on many platforms the restart is extremely slow.
>>>> We haven't been able to determine the cause. However, fortunately,
>>>> the problem didn't get carried over to Flash3, so we strongly
>>>> recommend switching over to Flash3 if it is possible for you.
>>>>
>>>> Anshu
>>>>
>>>>> Hi all,
>>>>>
>>>>>   I have a serious problem when using the parallel hdf5 module to
>>>>> restart from a checkpoint file. It takes about half an hour to read a
>>>>> single variable to each node for a 1GB file. The same problem was
>>>>> reported two year ago
>>>>> (http://flash.uchicago.edu/pipermail/flash-users/2005-April/001938.html)
>>>>> but I cannot find the follow-ups. Did anyone succeed in using the
>>>>> parallel hdf5 module so far? Will the Flash center continue support
>>>>> such
>>>>> issue on Flash2.5? Thanks!
>>>>>
>>>>> Bests,
>>>>> Shikui
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>>
>




More information about the flash-users mailing list