[FLASH-USERS] truncation errors and restarting

Mateusz Ruszkowski mateuszr at umich.edu
Mon Nov 17 13:49:31 EST 2008


  Hi,

I am restarting from a checkpoint file a run that crashed due to a 
segmentation fault. For many steps after the restart the code behaves 
*exactly* the same way as before prior to the crash. However, at some 
point the timestep changes by 10% in one step. The restarted run is now 
past the point where the code crashed before and is now running normally. 
I understand that it may be very difficult to diagnose a problem like this 
one. But I am wondering if it is possible that some tiny truncation errors 
in the checkpoint file eventually made code behave sightly differently 
(e.g., different switches got activated, etc) which in turn prevented the 
crash due to the segmentation fault. Is this possible ?

   Mateusz

P.S. this is what happens around the time the timestep changes
(previous timesteps were identical)

Old:

  [ 11-17-2008  03:59:53.253 ] [gr_hgSolve]: gr_hgSolve: ite  4: 
norm(residual)/norm(src) =  1.445309E-06
  [ 11-17-2008  03:59:53.289 ] [mpi_amr_comm_setup]: 
buffer_dim_send=291565, buffer_dim_recv=218221
  [ 11-17-2008  03:59:53.816 ] [mpi_amr_comm_setup]: 
buffer_dim_send=232373, buffer_dim_recv=188625
  [ 11-17-2008  03:59:54.186 ] [mpi_amr_comm_setup]: 
buffer_dim_send=227937, buffer_dim_recv=188625
  [ 11-17-2008  03:59:54.692 ] [mpi_amr_comm_setup]: 
buffer_dim_send=227353, buffer_dim_recv=188625
  [ 11-17-2008  03:59:56.718 ] [gr_hgSolve]: gr_hgSolve: ite  5: 
norm(residual)/norm(src) =  1.673032E-07
  [ 11-17-2008  03:59:58.194 ] step: n=1029 t=7.484192E+05 dt=2.140401E+02
  [ 11-17-2008  03:59:58.536 ] [mpi_amr_comm_setup]: 
buffer_dim_send=5646649, buffer_dim_recv=4801841
  [ 11-17-2008  04:00:07.804 ] [mpi_amr_comm_setup]: 
buffer_dim_send=1310137, buffer_dim_recv=1107121
  [ 11-17-2008  04:00:13.468 ] [mpi_amr_comm_setup]: 
buffer_dim_send=4375321, buffer_dim_recv=3722705

Restarted:

  [ 11-17-2008  11:29:55.951 ] [gr_hgSolve]: gr_hgSolve: ite  4: 
norm(residual)/norm(src) = 1.445309E-06
  [ 11-17-2008  11:29:55.986 ] [mpi_amr_comm_setup]: 
buffer_dim_send=291565, buffer_dim_recv=218221
  [ 11-17-2008  11:29:56.550 ] [mpi_amr_comm_setup]: 
buffer_dim_send=232373, buffer_dim_recv=188625
  [ 11-17-2008  11:29:56.921 ] [mpi_amr_comm_setup]: 
buffer_dim_send=227937, buffer_dim_recv=188625
  [ 11-17-2008  11:29:57.454 ] [mpi_amr_comm_setup]: 
buffer_dim_send=227353, buffer_dim_recv=188625
  [ 11-17-2008  11:29:59.433 ] [gr_hgSolve]: gr_hgSolve: ite  5: 
norm(residual)/norm(src) = 1.673031E-07
  [ 11-17-2008  11:30:05.081 ] step: n=1029 t=7.484192E+05 dt=2.506123E+02
  [ 11-17-2008  11:30:05.436 ] [mpi_amr_comm_setup]: 
buffer_dim_send=5646649, buffer_dim_recv=4801841
  [ 11-17-2008  11:30:14.690 ] [mpi_amr_comm_setup]: 
buffer_dim_send=1310137, buffer_dim_recv=1107121
  [ 11-17-2008  11:30:20.366 ] [mpi_amr_comm_setup]: 
buffer_dim_send=4375321, buffer_dim_recv=3722705






More information about the flash-users mailing list