[FLASH-USERS] restarting FLASH
mateuszr at umich.edu
mateuszr at umich.edu
Wed Aug 27 19:33:09 EDT 2008
Hi Josef/Paul,
thanks for your suggestions. I did more experiments with restarting
the code and here are my observations:
1) serial IO restart (compiled with bounds check and traceback) shows
that there is indeed a problem in the part of the code that Josef
described in his e-mail. I included that update in
IO/IOMain/hdf5/serial/PM/io_readData.F90
as he suggested. This nicely removed the segmentation fault but
unfortunately led to the following error message (and to the crashing
of the code):
INFO: Grid_fillGuardCells is ignoring masking.
[flash_convert_cc_hook] PE= 92, ivar= 3, why=2
Trying to convert non-zero mass-specific variable to per-volume
form, but dens
is zero!
The checkpoint file from which I am attempting to restart the
simulation is "valid" and looks as expected when inspected w/ IDL.
2) when the code is ran in the parallel IO mode, stopped and then
restarted, it effectively stops without crashing at the step n=1
(i.e., no output is produced and the batch job has "R" status). This
does not depend on the amount of allocated memory or the number of
processors.
This is all a bit puzzling. If anybody has any thoughts on this then I
would be grateful for comments/suggestions.
thanks,
Mateusz
Quoting Josef Stöckl <josef.stoeckl at uibk.ac.at>:
> Hi Mateusz,
>
> There is a distinct restart bug in the serial HDF5 IO unit, which I decribed
> a few months back in the flash-bugs mailing list. Basically in a 3D problem
> the wrong buffer variable gets used (most likely due to copy-and-paste). I
> also posted a fix, which consists of modifying one line in the file
> IO/IOMain/hdf5/serial/PM/io_readData.F90:
>
> if(NDIM .gt. 2) then
>
> allocate(faceZBuf(NUNK_VARS, NXB, NYB, NZB+1, localNumBlocks))
> - call MPI_RECV(unk(i,:,:,:,1:localNumBlocks), &
> + call MPI_RECV(faceZBuf(i,:,:,:,1:localNumBlocks), &
> NXB*NYB*(NZB+1)*localNumBlocks, &
> FLASH_REAL, MASTER_PE, &
> 9+i+NUNK_VARS+(NFACE_VARS*2), &
> MPI_COMM_WORLD, status, ierr)
> facevarz(i,io_ilo:io_ihi, io_jlo:io_jhi,
> io_klo:io_khi+1,1:localNumBlocks) = &
> faceZBuf(i,1:NXB,1:NYB,1:NZB+1,1:localNumBlocks)
> deallocate(faceZBuf)
>
> end if
>
> (this is a unified patch-like description)
>
> I hope this helps you!
>
> Best regards,
> Josef
>
>
> -----Ursprüngliche Nachricht-----
> Von: flash-users-bounces at flash.uchicago.edu
> [mailto:flash-users-bounces at flash.uchicago.edu] Im Auftrag von
> mateuszr at umich.edu
> Gesendet: Samstag, 23. August 2008 20:53
> An: flash-users at flash.uchicago.edu
> Betreff: [FLASH-USERS] restarting FLASH
>
>
> Hi all,
>
> I am having trouble restarting the code. More specifically, I am
> getting a segmentation fault when I try to restart from a valid
> checkpoint file. The details are enclosed below. I would be vary
> grateful for some clues.
>
> thanks,
> Mateusz
>
>
> [mateuszr at galaxy ~/FLASH]$ more out.txt
>
> [io_readData] Opening test_hdf5_chk_0011 for restart
> rank 1 in job 2 galaxy001.astro.lsa.umich.edu_38345 caused
> collective abort of all ranks
> exit status of rank 1: killed by signal 9
>
> [mateuszr at galaxy ~/FLASH]$ more flash_run.err
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image PC Routine Line Source
> flash3 000000000050CC48 Unknown Unknown Unknown
> flash3 000000000043EEBA Unknown Unknown Unknown
> flash3 000000000043C359 Unknown Unknown Unknown
> flash3 0000000000410513 Unknown Unknown Unknown
> flash3 0000000000415C19 Unknown Unknown Unknown
> flash3 0000000000406982 Unknown Unknown Unknown
> libc.so.6 00000035AFE1D8A4 Unknown Unknown Unknown
> flash3 00000000004068A9 Unknown Unknown Unknown
> touch: cannot touch `/scratch///.keep': Permission denied
>
>
>
>
>
>
>
More information about the flash-users
mailing list