[FLASH-USERS] Strange Behaviour Running FLASH
Alexander Sheardown
A.Sheardown at 2011.hull.ac.uk
Thu Jul 13 08:36:02 EDT 2017
Hi All,
I am running N-body + Hydro simulations of cluster mergers with FLASH.
My code, depending on the initial parameters used (such as the refinement or placement of the clusters) or the number of nodes I run over, either fails straight away during the initialisation with a generic segmentation fault or will run for a while and then "hang" i.e. not do anything code wise although it still runs on our HPC using up memory.
If the job hangs, I can usually run from the previous checkpoint and it will run further along until it hangs again, so its a case of restarting the simulation every time it hangs until the simulation completes its run. Which isn't ideal but at least I can get some work out of it.
For example, if I run a hydrostatic test for a single galaxy cluster on 20 nodes it fails straight away but if I increase the number of nodes to 100 then the simulation runs all the way to completion. I have done two hydrostatic tests now on a single cluster and these two simulations ran fine without any issues (I run for 10 Gyrs and produce a 1000 plot files). I have also done a cluster merger simulation which on the first attempt reached 460 plot files before hanging while on the second attempt (starting from the beginning) it got all the way up to 862 plot files before hanging. Both simulations use the same flash4.
Block wise, the simulations should be able to run on say 10 or 20 nodes, but they have only been running when I use 50 or more nodes where here it only uses between 2-6 blocks per processor. Interestingly, I have been running with debugging flags and optimisation flags of -O0 instead of the usual -O3 and the simulations appear to be running faster and get further along this way.
I have used an intel inspector debugging tool on a run where I used 20 nodes (when it fails straightaway) and it showed that there are 19 memory leaks and 1 memory deallocation issue:
[cid:9d902851-9e13-4662-8602-125a41d61c3c]
The memory deallocation issue is only a warning and comes from the flash4 binary where it complains about all the allocations from files such as Grid_init.F90, Particles_init.F90, amr_initialize.F90 which is obviously something I don't touch and would presume FLASH handles this allocation correctly anyway.
The memory leak issues to do with lipmpi etc I don't know much about except that they are libraries from the MPI package. I have mentioned this to our HPC engineers and they said "if these libraries have something wrong it should affect everyone using the same version of MPI on Viper, which is not the case. In some very unlikely cases, there might be a bug or an issue with one of these libraries (that is more probable with omnipath)" .
I have been through my code many times by myself and with colleagues and we just can't see any issue there. I am using standard FLASH modules so nothing has been altered in that regard.
Has anyone got an idea what is happening? Has anyone come across problems with FLASH before with memory leak issues like I have shown? Is this a problem with the mpi installation on the HPC or is it more likely there is a problem with my code somewhere?
I am currently using intel/mpi/64/5.1.3.181 and hdf5/intel/intelmpi/1.8.16 to compile/run my jobs. I have also tried with hdf5/gcc/openmpi/1.8.16 and openmpi/gcc/1.10.5 but I experience similar behaviour.
Any thoughts on this will be greatly appreciated!
Many Thanks,
Alex
________________________________
Mr Alex Sheardown
Postgraduate Research Student
E.A. Milne Centre for Astrophysics
University of Hull
Cottingham Road
Kingston upon Hull
HU6 7RX
www.milne.hull.ac.uk<https://mail.hull.ac.uk/owa/redir.aspx?REF=_wok6-STjTeTuQlVeEE3DYaVcvKXJXINIb2ho14u7UoAceEsmknTCAFodHRwOi8vd3d3Lm1pbG5lLmh1bGwuYWMudWs.>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20170713/c14f614b/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2017-07-13 at 10.57.23.png
Type: image/png
Size: 97946 bytes
Desc: Screen Shot 2017-07-13 at 10.57.23.png
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20170713/c14f614b/attachment.png>
-------------- next part --------------
**************************************************
To view the terms under which this email is
distributed, please go to
http://www2.hull.ac.uk/legal/disclaimer.aspx
**************************************************
More information about the flash-users
mailing list