[FLASH-BUGS] sedov_sph with Flash2.4 (II)

Tomasz Plewa tomek at flash.uchicago.edu
Thu Oct 21 11:18:36 CDT 2004


Peter -

I am not in the trenches on this, but have you tried to place barriers
around some critical sections? Such experiments should be relatively
simple to made as it seems the error is reproducible and you have
the initial data ready for testing.

Also, did anyone from the users try this problem and observed anomalies?

Tomek
--
On Thu, Oct 21, 2004 at 04:14:22PM +0200, Peter Woitke wrote:
> Dear developers,
> 
> -- see also 1.submission "sedov_sph with Flash2.4" by Erik-Jan Rijkhorst --
> 
> We are still having trouble to run 1D models (sedov, sedov_sph)
> on the parallel computers to our disposal (ASTER/TERAS, see 
> http://www.sara.nl). Meanwhile we talked to the operators at these
> computing center and have a few more informations. The problem is
> still unsolved, however.
> 
> Description of the problem
> ==========================
> Error in mesh_prolong.F90 or subordinate routines: after the prolongation, 
> one or several new child blocks have wrong solndata, see attached output 
> from files created before (5003) and after (5005) the call of 
> mesh_prolong.F90 in source/mesh/amr/update_grid_refinement.F90.
> 
> Further description is related to sedov_sph -1d with 8 PEs:
> ===========================================================
> - The same error occurs with FLASH2.3 and FLASH2.4
> 
> - The error is reproducable
> 
> - The same error occurs on ASTER (64-bit linux) and TERAS (64bit SGI origin),
>    but not on our local linux-cluster!
> 
> - The same error occurs with mpt-module xor mpich-1.2.5-module loaded
>    (different MPI implementations)
> 
> - The error occurs only for -1d models
> 
> - The error occurs only for certain numbers of PEs as described
>    in our first submission
> 
> - The error occurs seldomly, for this problem at timestep 1666, but not
>    at all the refinements done before.
> 
> Further observations/ideas
> ==========================
> The occurrence of this error only for -1d models with lots of processors
> which might suggest that the super-fast MPI-implementation is a problem:
> many MPI_actions are taken shortly after each other with only few data.
> 
> The refinement step at 1666 is a complicated one: one block needs
> to be refined, but since it is surrounded by a row of less refined 
> neighbours on the right hand side, 10 new blocks are created.
> 
> We would be very happy about any kind of comments by the developers.
> 
> 
> Kind regards,
> 
> Peter Woitke  &  Erik-Jan Rijkhorst
> 
> 
> 
> 
> 
> PS: details about setup and compilation
> =======================================
> ----------
> setup call
> ----------
> ./setup sedov -1d -auto, ./setup sedov_sph -1d -auto, respectively
> 
> ----------------------------------
> Makefile.h: (for TERAS SGI-origin)
> ----------------------------------
> HDF5_PATH = /usr/local/opt/hdf5-1.4.4
> FCOMP   = f90
> CCOMP   = cc
> CPPCOMP = CC
> LINK    = f90
>    (MIPSpro SGI f90/cc compilers)
> FFLAGS_OPT  = -64 -c -r8 -d8 -i4 -cpp -mips4 -O3 -Ofast=ip35
>    (we also tried without opimisation - no difference)
> LFLAGS_OPT  = -64 -r8 -d8 -i4 -IPA -o
> LIB_HDF5  = -L$(HDF5_PATH)/lib -lhdf5
> LIB_OPT   = -lmpi
> 
> --------
> run-call
> --------
> mpirun -np 8 flash2
> 
> --------------------------
> slightly changed flash.par
> --------------------------
> lrefine_min     = 1
> lrefine_max     = 8
> basenm          = "sedov_sph_"
> restart         = .false.
> tplot           = 0.001
> trstrt          = 0.01
> nend            = 1700
> tmax            = 0.05
> plot_var_1      = "dens"
> plot_var_2      = "pres"
> plot_var_3      = "temp"
> plot_var_4      = "velx"
> 
> --------------------------------------------------
> This is how we created the additional output files
> (in source/mesh/amr/update_grid_refinement.F90)
> --------------------------------------------------
>          integer,save:: counter = 0
>          ...
>          call mark_grid_refinement()
>          if ( nstep .eq. 1466 .or. nstep .eq. 1666) then
>            call plotfile(5000+counter, time)
>            counter = counter + 1
>          endif
>          call mesh_refine_derefine()
>          do block_no = 1,lnblocks
>             call grid (block_no)
>          end do
>          new_child =>  dBaseTreePtrNewChild()
>          if (conserved_var) then
>             do block_no = 1, lnblocks
>                if ( .not. new_child(block_no)) then
>                   solnData => dBaseGetDataPtrSingleBlock(block_no, GC)
>                   ...
>                   call convert_var_prim_to_cons( solnData(:,:,:,:) )
>                   call dBaseReleaseDataPtrSingleBlock(block_no, solnData)
>                endif
>             enddo
>          endif
>          if ( nstep .eq. 1466 .or. nstep .eq. 1666) then
>            call plotfile(5000+counter, time)
>            counter = counter + 1
>          endif
>          call mesh_prolong (MyPE, 1, nguard)
>          if ( nstep .eq. 1466 .or. nstep .eq. 1666) then
>            call plotfile(5000+counter, time)
>            counter = counter + 1
>          endif
>          call mesh_guardcell (MyPE, 1, nguard, time, 1, 0)
>          ...
> 
> plotfiles 5000,5001,5002 are created during an exemplary working 
> refinement step (nstep=1466),
> plotfiles 5003,5004,5005 are created during the refinement step that 
> causes the error (nstep=1666).
> 
> ----------------
> The stdout file:
> ----------------
>      1662 3.0518E-02 1.1371E-05 |  1.137E-05
>      1663 3.0541E-02 1.1372E-05 |  1.137E-05
>      1664 3.0564E-02 1.1374E-05 |  1.137E-05
>      1665 3.0587E-02 1.1375E-05 |  1.138E-05
>   block to be refined:  myPE=6, blockno=6  (print from amr_refine_derefine)
>   *** Wrote output to sedov_sph_hdf5_plt_cnt_5003 ***
>   PE  0  lnblocks  11                      (print from amr_refine_derefine)
>   PE  1  lnblocks  7
>   PE  2  lnblocks  7
>   PE  3  lnblocks  9
>   PE  4  lnblocks  6
>   PE  5  lnblocks  7
>   PE  6  lnblocks  7
>   PE  7  lnblocks  5
>   min_blocks          6 max_blocks         12 tot_blocks         69
>   *** Wrote output to sedov_sph_hdf5_plt_cnt_5004 ***
>   *** Wrote output to sedov_sph_hdf5_plt_cnt_5005 ***
> 




-- 
Thu, 11:18 CDT (16:18 GMT), Oct-21-2004
_______________________________________________________________________________

   Tomasz Plewa                                      www:   flash.uchicago.edu
   Computational Physics and Validation Group        email: tomek at uchicago.edu
   The ASC FLASH Center, The University of Chicago   phone: 773.834.3227
   5640 South Ellis, RI 475, Chicago, IL 60637       fax:   773.834.3230
_______________________________________________________________________________



More information about the flash-bugs mailing list