[FLASH-BUGS] sedov_sph with Flash2.4 (II)
Tomasz Plewa
tomek at flash.uchicago.edu
Thu Oct 21 11:18:36 CDT 2004
Peter -
I am not in the trenches on this, but have you tried to place barriers
around some critical sections? Such experiments should be relatively
simple to made as it seems the error is reproducible and you have
the initial data ready for testing.
Also, did anyone from the users try this problem and observed anomalies?
Tomek
--
On Thu, Oct 21, 2004 at 04:14:22PM +0200, Peter Woitke wrote:
> Dear developers,
>
> -- see also 1.submission "sedov_sph with Flash2.4" by Erik-Jan Rijkhorst --
>
> We are still having trouble to run 1D models (sedov, sedov_sph)
> on the parallel computers to our disposal (ASTER/TERAS, see
> http://www.sara.nl). Meanwhile we talked to the operators at these
> computing center and have a few more informations. The problem is
> still unsolved, however.
>
> Description of the problem
> ==========================
> Error in mesh_prolong.F90 or subordinate routines: after the prolongation,
> one or several new child blocks have wrong solndata, see attached output
> from files created before (5003) and after (5005) the call of
> mesh_prolong.F90 in source/mesh/amr/update_grid_refinement.F90.
>
> Further description is related to sedov_sph -1d with 8 PEs:
> ===========================================================
> - The same error occurs with FLASH2.3 and FLASH2.4
>
> - The error is reproducable
>
> - The same error occurs on ASTER (64-bit linux) and TERAS (64bit SGI origin),
> but not on our local linux-cluster!
>
> - The same error occurs with mpt-module xor mpich-1.2.5-module loaded
> (different MPI implementations)
>
> - The error occurs only for -1d models
>
> - The error occurs only for certain numbers of PEs as described
> in our first submission
>
> - The error occurs seldomly, for this problem at timestep 1666, but not
> at all the refinements done before.
>
> Further observations/ideas
> ==========================
> The occurrence of this error only for -1d models with lots of processors
> which might suggest that the super-fast MPI-implementation is a problem:
> many MPI_actions are taken shortly after each other with only few data.
>
> The refinement step at 1666 is a complicated one: one block needs
> to be refined, but since it is surrounded by a row of less refined
> neighbours on the right hand side, 10 new blocks are created.
>
> We would be very happy about any kind of comments by the developers.
>
>
> Kind regards,
>
> Peter Woitke & Erik-Jan Rijkhorst
>
>
>
>
>
> PS: details about setup and compilation
> =======================================
> ----------
> setup call
> ----------
> ./setup sedov -1d -auto, ./setup sedov_sph -1d -auto, respectively
>
> ----------------------------------
> Makefile.h: (for TERAS SGI-origin)
> ----------------------------------
> HDF5_PATH = /usr/local/opt/hdf5-1.4.4
> FCOMP = f90
> CCOMP = cc
> CPPCOMP = CC
> LINK = f90
> (MIPSpro SGI f90/cc compilers)
> FFLAGS_OPT = -64 -c -r8 -d8 -i4 -cpp -mips4 -O3 -Ofast=ip35
> (we also tried without opimisation - no difference)
> LFLAGS_OPT = -64 -r8 -d8 -i4 -IPA -o
> LIB_HDF5 = -L$(HDF5_PATH)/lib -lhdf5
> LIB_OPT = -lmpi
>
> --------
> run-call
> --------
> mpirun -np 8 flash2
>
> --------------------------
> slightly changed flash.par
> --------------------------
> lrefine_min = 1
> lrefine_max = 8
> basenm = "sedov_sph_"
> restart = .false.
> tplot = 0.001
> trstrt = 0.01
> nend = 1700
> tmax = 0.05
> plot_var_1 = "dens"
> plot_var_2 = "pres"
> plot_var_3 = "temp"
> plot_var_4 = "velx"
>
> --------------------------------------------------
> This is how we created the additional output files
> (in source/mesh/amr/update_grid_refinement.F90)
> --------------------------------------------------
> integer,save:: counter = 0
> ...
> call mark_grid_refinement()
> if ( nstep .eq. 1466 .or. nstep .eq. 1666) then
> call plotfile(5000+counter, time)
> counter = counter + 1
> endif
> call mesh_refine_derefine()
> do block_no = 1,lnblocks
> call grid (block_no)
> end do
> new_child => dBaseTreePtrNewChild()
> if (conserved_var) then
> do block_no = 1, lnblocks
> if ( .not. new_child(block_no)) then
> solnData => dBaseGetDataPtrSingleBlock(block_no, GC)
> ...
> call convert_var_prim_to_cons( solnData(:,:,:,:) )
> call dBaseReleaseDataPtrSingleBlock(block_no, solnData)
> endif
> enddo
> endif
> if ( nstep .eq. 1466 .or. nstep .eq. 1666) then
> call plotfile(5000+counter, time)
> counter = counter + 1
> endif
> call mesh_prolong (MyPE, 1, nguard)
> if ( nstep .eq. 1466 .or. nstep .eq. 1666) then
> call plotfile(5000+counter, time)
> counter = counter + 1
> endif
> call mesh_guardcell (MyPE, 1, nguard, time, 1, 0)
> ...
>
> plotfiles 5000,5001,5002 are created during an exemplary working
> refinement step (nstep=1466),
> plotfiles 5003,5004,5005 are created during the refinement step that
> causes the error (nstep=1666).
>
> ----------------
> The stdout file:
> ----------------
> 1662 3.0518E-02 1.1371E-05 | 1.137E-05
> 1663 3.0541E-02 1.1372E-05 | 1.137E-05
> 1664 3.0564E-02 1.1374E-05 | 1.137E-05
> 1665 3.0587E-02 1.1375E-05 | 1.138E-05
> block to be refined: myPE=6, blockno=6 (print from amr_refine_derefine)
> *** Wrote output to sedov_sph_hdf5_plt_cnt_5003 ***
> PE 0 lnblocks 11 (print from amr_refine_derefine)
> PE 1 lnblocks 7
> PE 2 lnblocks 7
> PE 3 lnblocks 9
> PE 4 lnblocks 6
> PE 5 lnblocks 7
> PE 6 lnblocks 7
> PE 7 lnblocks 5
> min_blocks 6 max_blocks 12 tot_blocks 69
> *** Wrote output to sedov_sph_hdf5_plt_cnt_5004 ***
> *** Wrote output to sedov_sph_hdf5_plt_cnt_5005 ***
>
--
Thu, 11:18 CDT (16:18 GMT), Oct-21-2004
_______________________________________________________________________________
Tomasz Plewa www: flash.uchicago.edu
Computational Physics and Validation Group email: tomek at uchicago.edu
The ASC FLASH Center, The University of Chicago phone: 773.834.3227
5640 South Ellis, RI 475, Chicago, IL 60637 fax: 773.834.3230
_______________________________________________________________________________
More information about the flash-bugs
mailing list