[FLASH-USERS] disable check pointing

Brock Palen brockp at umich.edu
Tue Jan 29 10:25:27 EST 2008


Turns out to disable checkpoint files from being written there is a  
parameter wall_clock_checkpoint which defaults in both flash2.5 and  
flash3 to 12 hours.  I set this to -1 in my flash.par rebuilt, and  
all appears well.

I am now up to 29 plot's without a single checkpoint, when before I  
would always get killed around 10 plots.

Of-course I would like checkpointing, so I would like to find the  
segfault and fix it ether in ROMIO, hdf5, or OpenMPI or flash.  We  
were using hdf5-serial.  I have not tried parallel.

Note to anyone reading be sure to _always_ compile with optimizations  
for your compiler. (I use -fast under pgi-7.0, fastsse I have not  
verified correctness).

A flash run compiled with -g and no overriding optimizations, is ~120  
step's behind one compiled with -fast on the same hardware/problem/ 
libraries/etc. for the same wallclock.

THanks guys I think we are all set to do research now, If there Is  
anything you want me to do to track down the segfault let me know.

Brock Palen
Center for Advanced Computing
brockp at umich.edu
(734)936-1985


On Jan 18, 2008, at 1:17 PM, John ZuHone wrote:

> Brock,
>
> 	I think that if you want no checkpoints at all you should set  
> trstrt < 0, and make sure that nrstrt < 0 (integer) and zrstrt < 0  
> as well.
>
> Best,
>
> John ZuHone
>
> On Jan 18, 2008, at 8:56 AM, Brock Palen wrote:
>
>> For parallel runs even if I set
>>
>> trstrt=0
>>
>> two checkpoint files were written.
>> [ 01-17-2008  23:51.27 ] message: [CHECKPOINT_WR] NOTE: will  
>> send           10     blocks per message.
>> [ 01-17-2008  23:51.29 ] file_wr_open: type=checkpoint  
>> name=Ni_Tem_hdf5_chk_0000
>> [ 01-17-2008  23:52.30 ] file_wr_close: type=checkpoint  
>> name=Ni_Tem_hdf5_chk_0000 blocks=5449
>> [ 01-17-2008  23:52.32 ] file_wr_open: type=plotfile  
>> name=Ni_Tem_hdf5_plt_cnt_0000
>> [ 01-17-2008  23:52.36 ] file_wr_close: type=plotfile  
>> name=Ni_Tem_hdf5_plt_cnt_0000
>> [ 01-17-2008  23:52.36 ] [FLASH]: Enter evolution loop...
>> [ 01-17-2008  23:52.36 ] step: n=1 t=0.000000E+00 dt=1.000000E-16
>> [ 01-17-2008  23:59.26 ] file_wr_open: type=checkpoint  
>> name=Ni_Tem_hdf5_chk_0001
>> [ 01-18-2008  00:00.13 ] file_wr_close: type=checkpoint  
>> name=Ni_Tem_hdf5_chk_0001 blocks=5449
>>
>> No other checkpoint files have been written so far,  though  
>> several plot files have been written.  Is this expected behavior?
>>
>> Flash2.5 + pgi +openmpi-1.2.3
>> hdf5/serial
>>
>>
>> Brock Palen
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>
>
>




More information about the flash-users mailing list