[FLASH-USERS] restarting FLASH

mateuszr at umich.edu mateuszr at umich.edu
Wed Aug 27 19:33:09 EDT 2008



   Hi Josef/Paul,

thanks for your suggestions. I did more experiments with restarting  
the code and here are my observations:

1) serial IO restart (compiled with bounds check and traceback) shows  
that there is indeed a problem in the part of the code that Josef  
described in his e-mail. I included that update in  
IO/IOMain/hdf5/serial/PM/io_readData.F90
as he suggested. This nicely removed the segmentation fault but  
unfortunately led to the following error message (and to the crashing  
of the code):

  INFO: Grid_fillGuardCells is ignoring masking.
[flash_convert_cc_hook] PE=     92, ivar=  3, why=2
  Trying to convert non-zero mass-specific variable to per-volume  
form, but dens
  is zero!

The checkpoint file from which I am attempting to restart the  
simulation is "valid" and looks as expected when inspected w/ IDL.

2) when the code is ran in the parallel IO mode, stopped and then  
restarted, it effectively stops without crashing at the step n=1  
(i.e., no output is produced and the batch job has "R" status). This  
does not depend on the amount of allocated memory or the number of  
processors.

This is all a bit puzzling. If anybody has any thoughts on this then I  
would be grateful for comments/suggestions.

   thanks,
    Mateusz







Quoting Josef Stöckl <josef.stoeckl at uibk.ac.at>:

> Hi Mateusz,
>
> There is a distinct restart bug in the serial HDF5 IO unit, which I decribed
> a few months back in the flash-bugs mailing list. Basically in a 3D problem
> the wrong buffer variable gets used (most likely due to copy-and-paste). I
> also posted a fix, which consists of modifying one line in the file
> IO/IOMain/hdf5/serial/PM/io_readData.F90:
>
>     if(NDIM .gt. 2) then
>
>           allocate(faceZBuf(NUNK_VARS, NXB, NYB, NZB+1, localNumBlocks))
> -          call MPI_RECV(unk(i,:,:,:,1:localNumBlocks), &
> +          call MPI_RECV(faceZBuf(i,:,:,:,1:localNumBlocks), &
>                NXB*NYB*(NZB+1)*localNumBlocks, &
>                FLASH_REAL, MASTER_PE, &
>        9+i+NUNK_VARS+(NFACE_VARS*2), &
>                MPI_COMM_WORLD, status, ierr)
>           facevarz(i,io_ilo:io_ihi, io_jlo:io_jhi,
> io_klo:io_khi+1,1:localNumBlocks) = &
>           faceZBuf(i,1:NXB,1:NYB,1:NZB+1,1:localNumBlocks)
>       deallocate(faceZBuf)
>
>     end if
>
> (this is a unified patch-like description)
>
> I hope this helps you!
>
> Best regards,
> Josef
>
>
> -----Ursprüngliche Nachricht-----
> Von: flash-users-bounces at flash.uchicago.edu
> [mailto:flash-users-bounces at flash.uchicago.edu] Im Auftrag von
> mateuszr at umich.edu
> Gesendet: Samstag, 23. August 2008 20:53
> An: flash-users at flash.uchicago.edu
> Betreff: [FLASH-USERS] restarting FLASH
>
>
>   Hi all,
>
> I am having trouble restarting the code. More specifically, I am
> getting a segmentation fault when I try to restart from a valid
> checkpoint file. The details are enclosed below. I would be vary
> grateful for some clues.
>
>   thanks,
>     Mateusz
>
>
> [mateuszr at galaxy ~/FLASH]$ more out.txt
>
>  [io_readData] Opening test_hdf5_chk_0011 for restart
> rank 1 in job 2  galaxy001.astro.lsa.umich.edu_38345   caused
> collective abort of all ranks
>   exit status of rank 1: killed by signal 9
>
> [mateuszr at galaxy ~/FLASH]$ more flash_run.err
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image              PC                Routine            Line        Source
> flash3             000000000050CC48  Unknown               Unknown  Unknown
> flash3             000000000043EEBA  Unknown               Unknown  Unknown
> flash3             000000000043C359  Unknown               Unknown  Unknown
> flash3             0000000000410513  Unknown               Unknown  Unknown
> flash3             0000000000415C19  Unknown               Unknown  Unknown
> flash3             0000000000406982  Unknown               Unknown  Unknown
> libc.so.6          00000035AFE1D8A4  Unknown               Unknown  Unknown
> flash3             00000000004068A9  Unknown               Unknown  Unknown
> touch: cannot touch `/scratch///.keep': Permission denied
>
>
>
>
>
>
>




More information about the flash-users mailing list