[FLASH-USERS] Big Problem??

Seyit Hocuk seyit at astro.rug.nl
Mon Aug 18 10:38:26 EDT 2008


Hi Chris, and others,

Thank you, that solved the two errors for standard jeans and sedov setup 
by the way.  However, those were unrelated to our problem of 
differences. Besides, we found another mistake in jeans setup 
ref_marking routine. There was a "save" missing when declaring 
refine_cutoff and derefine_cutoff and the filter.

Anyway, after these, there were no other errors showing up when 
debugging, so the code is a-ok also for our own setup. No uninitialized 
parameters. We downloaded the latest compilers of ifort (10.1.012 / 015 
/ 017) and tried them each. The difference between ifort 8.0 and 10.1 
seems huge even on the same machine! Refinement is very different even 
for normal jeans setup. The differences became very small when we 
reduced the boxsize just to check. Normally I use boxsize of 8.0E19, 
when we tried 8.0E9 (mind you its 10 orders of magnitude less), the 
differences were minimal.

Now comparing the two computers, one 32-bit one 64-bit but both 
same/similar compilers, results in very little differences. This means, 
I guess, that there was something wrong with ifort 8.0 (for large 
numbers perhaps?). If so, that anoying thing costed us many many hours, 
days, weeks and headaches of searching. The slight differences that 
still exist can hopefully be contributed to the difference in amount of 
bits. I hope you guys can answer that better.


Thanks,
Seyit

ps: don't say switch to FLASH3.0, because I am trying.




Chris Daley wrote:
> Hi Seyit,
>
> The compiler flag is doing its job - we get a run-time error when an
> uninitialised variable is used.
>
> 1.  I can confirm that the abort from perfmon.F90 is an error.  Notice
> that the subroutine argument is "check_ptnum", but the condition in the
> "if" statement evaluates a variable named "checkpt_num".  This 
> emphasises the
> importance of "implicit none".
>
> 2.  The abort from poisson_mg_relax.F90 stems from a situation which will
> not normally cause problems in a simulation.  This is because the 
> "error" variable is
> part of the following logical expression.
>
> done = (iter == nsmooth) .or. ((iter > iterating_to_convergence_limit) 
> .and. (error > 0.))
>
> i.e. (iter > iterating_to_convergence_limit) .and. (error > 0.).  So we
> need both conditions to be .true. to obtain a .true. in the cumulative 
> expression. In the poisson_mg_relax.F90 source, "error" is only ever
> initialised when "iter" > "iterating_to_convergence_limit", so the
> times when "error" is not initialised will generally not be an issue.
>
> In pursuit of robust code you should correct the code as follows:
>
> 1.  Rename "checkpt_num" variable to "check_ptnum"
> 2.  Initialise "error" to 0.0 after the variable declaration.
>
>
> If you re-compile and run you will hopefully get a similar kind
> of abort during execution of your simulation specific code.  The error 
> can then
> be resolved using similar steps as above.  You must keep the -g flag 
> as it
> gives us line-level information in the stack traces.
>
> Finally, I would recommend that you use FLASH3 if it
> supports the physics that you need.  It is much easier for
> us to answer your FLASH3 questions rather than FLASH2 questions,
> and we would rather spend our time making FLASH3 a better
> application.
>
> Chris
>
>
>
>
>
> Seyit Hocuk wrote:
>> Hi Chris, Nathan, Carlo,
>>
>> It has been a while now, but finally I had time to check some stuff. 
>> Like Chris said I tried compiling with -CU (check uninit) and also 
>> with -ftrapuv. The error with -CU is shown here below. However, the 
>> same error is showing when I use normal jeans setup and a different 
>> error is generated with Sedov test problem. I think there are many 
>> undeclared variables or subroutines anyway. When I use -ftrapuv, I 
>> get segmentation faults. These debug options are not much use 
>> unfortunately.
>>
>> By the way, what is the use of -g when debugging. It seems like it is 
>> doing nothing.
>>
>>
>>
>> * JEANS SETUP
>>
>> perturbation is unstable with growth time    8776137041619.21    
>> forrtl: severe (193): Run-Time Check Failure. The variable 
>> 'poisson_mg_relax_$ERROR' is being used without being defined
>> Image              PC                Routine            Line        
>> Source            flash2             00000000005B70FE  
>> Unknown               Unknown  Unknown
>> flash2             00000000005B62FE  Unknown               Unknown  
>> Unknown
>> flash2             000000000056EB56  Unknown               Unknown  
>> Unknown
>> flash2             0000000000536D51  Unknown               Unknown  
>> Unknown
>> flash2             00000000005383C6  Unknown               Unknown  
>> Unknown
>> flash2             00000000004D31E4  poisson_mg_relax_         315  
>> poisson_mg_relax.F90
>> flash2             00000000004D4DE9  poisson_mg_solve_          38  
>> poisson_mg_solve.F90
>> flash2             00000000004CA3C4  mg_cycle_                  82  
>> mg_cycle.F90
>> flash2             00000000004CEEC8  multigrid_                150  
>> multigrid.F90
>> flash2             00000000004D249A  poisson_                   87  
>> poisson.F90
>> flash2             0000000000430CE6  modulegravpotenti         139  
>> GravPotentialAllBlocks.F90
>> flash2             000000000042AF1B  init_from_scratch         257  
>> init_from_scratch.F90
>> flash2             000000000041D45A  init_flash_               324  
>> init_flash.F90
>> flash2             000000000041C5BF  MAIN__                     62  
>> flash.F90
>> flash2             0000000000405362  Unknown               Unknown  
>> Unknown
>> libc.so.6          00007F7DBB0131C4  Unknown               Unknown  
>> Unknown
>> flash2             00000000004052A9  Unknown               Unknown  
>> Unknown
>>
>>
>>
>> * SEDOV EXPLOSION SETUP
>>
>>
>> [CHECKPOINT_WR] NOTE: will send          710  blocks per message.
>> [CHECKPOINT_WR] Writing checkpoint file sedov_hdf5_chk_0000
>> Progress:  |
>> forrtl: severe (193): Run-Time Check Failure. The variable 
>> 'perfmon_mp_log_timers_$CHECKPT_NUM' is being used without being defined
>> Image              PC                Routine            Line        
>> Source            flash2             00000000005A1856  
>> Unknown               Unknown  Unknown
>> flash2             00000000005A0A56  Unknown               Unknown  
>> Unknown
>> flash2             000000000055A646  Unknown               Unknown  
>> Unknown
>> flash2             0000000000522B81  Unknown               Unknown  
>> Unknown
>> flash2             00000000005241F6  Unknown               Unknown  
>> Unknown
>> flash2             00000000004CB301  perfmon_mp_log_ti        1067  
>> perfmon.F90
>> flash2             0000000000465F75  checkpoint_wr_            701  
>> checkpoint_wr.F90
>> flash2             000000000045E2F6  output_initial_           189  
>> output_initial.F90
>> flash2             000000000041DB09  init_flash_               335  
>> init_flash.F90
>> flash2             000000000041CC33  MAIN__                     62  
>> flash.F90
>> flash2             00000000004051E2  Unknown               Unknown  
>> Unknown
>> libc.so.6          00007F0BCE1FF1C4  Unknown               Unknown  
>> Unknown
>> flash2             0000000000405129  Unknown               Unknown  
>> Unknown
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Chris Daley wrote:
>>> Hi Seyit,
>>>
>>> I fully agree with Carlo's recommendation - Valgrind is an excellent
>>> tool.  However, before resorting to such a powerful tool, it may be
>>> worthwhile using your compiler to detect uninitialised data.  I notice
>>> that you have the intel compiler on two of your computing platforms.
>>>
>>> Try adding "-check uninit" and "-traceback" to your Fortran
>>> (debug) compilation options.  This will check if an uninitialised 
>>> variable
>>> is used in a calculation, and will generate a runtime error if it is 
>>> used.
>>>
>>> I've just run a mini test in which I used a variable in my 
>>> Simulation_initBlock.F90
>>> that is never initialised.   Here is the output:
>>>
>>> forrtl: severe (193): Run-Time Check Failure. The variable 
>>> 'simulation_initblock_$PP' is being used without being defined
>>> Image              PC                Routine            Line        
>>> Source
>>> flash3             00000000006B010B  Unknown               Unknown  
>>> Unknown
>>> flash3             00000000006AE446  Unknown               Unknown  
>>> Unknown
>>> flash3             00000000006878DE  Unknown               Unknown  
>>> Unknown
>>> flash3             000000000065CEB2  Unknown               Unknown  
>>> Unknown
>>> flash3             000000000065D147  Unknown               Unknown  
>>> Unknown
>>> flash3             000000000049D3DC  simulation_initbl          88  
>>> Simulation_initBlock.F90
>>> flash3             000000000044F05B  Unknown               Unknown  
>>> Unknown
>>> flash3             000000000040E152  Unknown               Unknown  
>>> Unknown
>>> flash3             000000000041211D  Unknown               Unknown  
>>> Unknown
>>> flash3             0000000000404A6A  Unknown               Unknown  
>>> Unknown
>>> libc.so.6          000000355501C3FB  Unknown               Unknown  
>>> Unknown
>>> flash3             00000000004049AA  Unknown               Unknown  
>>> Unknown
>>> p0_31573:  p4_error: interrupt SIGx: 13
>>>
>>> Remember to compile your code in "debug" mode, i.e. to
>>> include the -g compilation flag and no optimisations.
>>>
>>> You may also want to look at the option "-ftrapuv" which initialises
>>> local stack variables to "unusual values".  Further information about
>>> all of these options can be found in the intel man page.
>>>
>>> Regards,
>>> Chris
>>>
>>>
>>> Carlo Graziani wrote:
>>>> Hi Seyit.
>>>>
>>>> Nathan's suggestion of investigating un-initialized variables use
>>>> and erratically allocated/deallocated memory is very sensible for
>>>> tracking down a problem that manifests itself as unpredictable 
>>>> behavior.
>>>>
>>>> There is actually an open-source tool for doing this called valgrind.
>>>> It may already be installed on your local linux systems, and if not it
>>>> is easy enough to obtain.
>>>>
>>>> One runs a program under valgrind very simply (the documentation gives
>>>> more options):
>>>>
>>>> Prompt> valgrind <program-name> <program arguments>
>>>>
>>>> Then one sits back and digests the potentially-voluminous output.
>>>>
>>>> Valgrind will flag any access to uninitialized memory and 
>>>> memory-management
>>>> screw-ups.
>>>>
>>>> Caveats are: (1) It's slow;
>>>>
>>>> (2) You'd be amazed at how much valgrind finds distasteful in system
>>>> libraries.  You'll probably have to filter away a bunch of 
>>>> uninteresting
>>>> warnings about libc/mpi/hdf5 and so on (which are probably harmless,
>>>> and which you can't do much about anyway).  Valgrind has some 
>>>> facilities
>>>> for suppressing certain types of warnings, which you can use to cut 
>>>> down
>>>> the noise.
>>>>
>>>> If you can make a small version of the problem, running on one 
>>>> processor, that
>>>> exhibits the erratic behavior, this would probably be an ideal case to
>>>> feed to valgrind.  There's some support for parallel debugging, but 
>>>> you'd probably
>>>> have to spend some quality time with documentation and haunt some 
>>>> other
>>>> mailing lists to get that running.
>>>>
>>>> Cheers,
>>>>
>>>> Carlo
>>>>
>>>> Nathan Hearn wrote:
>>>>> Hi Seyit,
>>>>>
>>>>>     An uninitialized variable is one that is declared (specified as
>>>>> integer, real, etc.), but not assigned a value.  Thus, an
>>>>> uninitialized variable usually has whatever value was in its memory
>>>>> location before it was declared.  (It could be a random number,
>>>>> "infinity," or just garbage.)  If this variable gets used before a
>>>>> value is assigned to it, strange behavior may result, which would be
>>>>> very compiler- and architecture-specific).  If you are using
>>>>> uninitialized pointer or allocatable variables, the effects can be
>>>>> quite drastic and hard to identify.
>>>>>
>>>>>     Generally speaking, it is a good idea to assign a value to every
>>>>> variable soon after it is declared, even if it is only a temporary
>>>>> value that is not actually used.  (As I recall, there is a way to
>>>>> assign a null value to pointers, which is also a very useful
>>>>> practice.)
>>>>>
>>>>>
>>>>> - Nathan
>>>>>
>>>>>
>>>>> On 8/4/08, Seyit Hocuk <seyit at astro.rug.nl> wrote:
>>>>>> Hi Paul, hi Nathan,
>>>>>>
>>>>>> First of all; using --with-default-api-version=v16 when configuring
>>>>>> hdf5-1.8.1 works fine. Thanks for that Paul.
>>>>>>
>>>>>> Nathan, if you mean by uninitialized that the types are not defined
>>>>>> (like in config file REAL or INTEGER or whatever), then no because I
>>>>>> define them all. But if you mean I have included modules or 
>>>>>> parameters
>>>>>> which I don't use, that's correct and I am no expert in 
>>>>>> programming so
>>>>>> it might indeed be good to check this.
>>>>>>
>>>>>> Greetz,
>>>>>> Seyit
>>>>
>>>>
>>>
>




More information about the flash-users mailing list