[FLASH-USERS] disable check pointing
Brock Palen
brockp at umich.edu
Tue Jan 29 10:25:27 EST 2008
Turns out to disable checkpoint files from being written there is a
parameter wall_clock_checkpoint which defaults in both flash2.5 and
flash3 to 12 hours. I set this to -1 in my flash.par rebuilt, and
all appears well.
I am now up to 29 plot's without a single checkpoint, when before I
would always get killed around 10 plots.
Of-course I would like checkpointing, so I would like to find the
segfault and fix it ether in ROMIO, hdf5, or OpenMPI or flash. We
were using hdf5-serial. I have not tried parallel.
Note to anyone reading be sure to _always_ compile with optimizations
for your compiler. (I use -fast under pgi-7.0, fastsse I have not
verified correctness).
A flash run compiled with -g and no overriding optimizations, is ~120
step's behind one compiled with -fast on the same hardware/problem/
libraries/etc. for the same wallclock.
THanks guys I think we are all set to do research now, If there Is
anything you want me to do to track down the segfault let me know.
Brock Palen
Center for Advanced Computing
brockp at umich.edu
(734)936-1985
On Jan 18, 2008, at 1:17 PM, John ZuHone wrote:
> Brock,
>
> I think that if you want no checkpoints at all you should set
> trstrt < 0, and make sure that nrstrt < 0 (integer) and zrstrt < 0
> as well.
>
> Best,
>
> John ZuHone
>
> On Jan 18, 2008, at 8:56 AM, Brock Palen wrote:
>
>> For parallel runs even if I set
>>
>> trstrt=0
>>
>> two checkpoint files were written.
>> [ 01-17-2008 23:51.27 ] message: [CHECKPOINT_WR] NOTE: will
>> send 10 blocks per message.
>> [ 01-17-2008 23:51.29 ] file_wr_open: type=checkpoint
>> name=Ni_Tem_hdf5_chk_0000
>> [ 01-17-2008 23:52.30 ] file_wr_close: type=checkpoint
>> name=Ni_Tem_hdf5_chk_0000 blocks=5449
>> [ 01-17-2008 23:52.32 ] file_wr_open: type=plotfile
>> name=Ni_Tem_hdf5_plt_cnt_0000
>> [ 01-17-2008 23:52.36 ] file_wr_close: type=plotfile
>> name=Ni_Tem_hdf5_plt_cnt_0000
>> [ 01-17-2008 23:52.36 ] [FLASH]: Enter evolution loop...
>> [ 01-17-2008 23:52.36 ] step: n=1 t=0.000000E+00 dt=1.000000E-16
>> [ 01-17-2008 23:59.26 ] file_wr_open: type=checkpoint
>> name=Ni_Tem_hdf5_chk_0001
>> [ 01-18-2008 00:00.13 ] file_wr_close: type=checkpoint
>> name=Ni_Tem_hdf5_chk_0001 blocks=5449
>>
>> No other checkpoint files have been written so far, though
>> several plot files have been written. Is this expected behavior?
>>
>> Flash2.5 + pgi +openmpi-1.2.3
>> hdf5/serial
>>
>>
>> Brock Palen
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>
>
>
More information about the flash-users
mailing list