[FLASH-USERS] Flash not progressing after 'Initial dt verified'

Lee Ellison lee at pacificfusion.com
Tue Oct 15 11:03:11 EDT 2024


Hi Gabriel,

I've seen stalls like this when my Makefile pointed to a different mpi
library than I was using to execute the program. I'll echo Adam Reyes'
suggestion to carefully check which mpi versions you're using throughout
the build (especially if your "site" changed when you switched OS?) and
execution.

Good luck!
Leland

________________
Leland Ellison PhD
Pacific Fusion
Lead - Modeling and Simulations

On Tue, Oct 15, 2024 at 5:01 AM Ryan Farber <rjfarber at umich.edu> wrote:

> Hi Gabriel,
>
> Unfortunately, I haven't had a "stalling" FLASH issue in quite some time.
> From what I recall (aligned with what Adam mentioned), I found my issue
> occurred during MPI communication. I think my fix was to switch MPI
> packages. So, you might want to try using a newer version of openmpi (in
> case openSUSE had a bug in 4.0.5 which got patched) or switch to mpich etc.
> If you really want to trace where the stall occurs, you can try using a
> parallel debugging tool to step through your code (or manually by adding
> write statements).
>
> Best wishes,
> --------
> Ryan
>
>
> On Tue, Oct 15, 2024 at 4:37 AM Gabriel Pérez Callejo <
> gabriel.perez.callejo at uva.es> wrote:
>
>> Dear both,
>>
>> Thanks for the help. I have added the DEBUG line to the setup and the
>> "print" line to both Timers_start and Timers_stop. I am attaching the new
>> log, stdout and stderr files (the STDOUT file is now significatly larger).
>> I am still quite unsure what is making this stall... The stalling only
>> happens for parallel runs indeed.
>>
>> Thanks again,
>> *Gabriel Pérez-Callejo*
>> Profesor Ayudante Doctor (Assistant Professor)
>> Departamento de Física Teórica, Atómica y Óptica
>> Universidad de Valladolid
>> Valladolid, Spain
>> +34 983 18 6513
>>
>>
>> El 15/10/24 a las 12:37, Reyes, Adam escribió:
>>
>> Hi Gabriel & Ryan,
>>
>> Does the stalling happen only for parallel runs (mpirun -np > 1)? It
>> seems likely that it could be stalling during some mpi communication. In my
>> experience making sure all the dependencies are built consistently with the
>> same compilers & mpi has helped.
>>
>> As for pinpointing where exactly the stall is happening there are a
>> couple of things you can try:
>>
>> * Setup with "-defines=DEBUG ALL”, this will turn on a lot of debugging
>> messages in all the FLASH units.
>>
>> * you can add a line like
>>
>> print *, “timers start”, name
>>
>>  to “Timers_start/stop.F90
>>
>> Both of these should print plenty of messages to help narrow down where
>> exactly the code is stalling.
>>
>>
>>
>>
>> *********************************************
>> Adam Reyes
>> [image: FLASH.jpg]
>> Code Group Leader, Flash Center for Computational Science
>> Research Scientist, Dept. of Physics and Astronomy
>> University of Rochester
>> River Campus: Bausch and Lomb Hall, 369
>> 500 Wilson Blvd. PO Box 270171, Rochester, NY 14627
>> Email adam.reyes at rochester.edu
>> Web https://flash.rochester.edu
>>  (he / him / his)
>> [image: FLASH-pride-sml.png]
>>
>> *********************************************
>>
>>
>>
>> On Oct 15, 2024, at 12:18 PM, Gabriel Pérez Callejo
>> <gabriel.perez.callejo at uva.es> <gabriel.perez.callejo at uva.es> wrote:
>>
>> Hi Ryan,
>>
>> Thanks for the quick response. I am attaching to this email the STDOUT,
>> STDERR and log files.
>>
>> To answer your questions, the simulation does stall. The ps command shows
>> the parallel processes as active, as well as the mpirun, but no progress is
>> done, nothing is printed in the log, STDOUT or STDERR files, and if I run a
>> *top* command, the machine is not working on FLASH.
>>
>> I have retried including +noio and -debug in my setup command, but it
>> works identically, same problem.
>>
>> Best,
>> *Gabriel Pérez-Callejo*
>> Profesor Ayudante Doctor (Assistant Professor)
>> Departamento de Física Teórica, Atómica y Óptica
>> Universidad de Valladolid
>> Valladolid, Spain
>> +34 983 18 6513
>>
>>
>> El 15/10/24 a las 12:00, Ryan Farber escribió:
>>
>> Hi Gabriel,
>>
>> I have encountered (and am to some extent still trying to understand) a
>> similar, possibly the same, issue (also with FLASH 4.8). I think the usual
>> issue I encounter is caused due to running out of memory, but it may also
>> be related to HDF5...
>>
>> Regarding your issue, does the run just stall? Such that ps aux | grep
>> flash shows the process is running but the simulation makes no progress in
>> outputting to your log file or STDOUT/STDERR file(s)?
>>
>> Or does the run die? [Some error is encountered / ps no longer shows the
>> process or it's in a completing, i/o, or zombie state.]
>>
>> It would be helpful if you can attach your log file and your
>> STDOUT/STDERR file(s). It would also be useful if you try using +noio to
>> determine if you have an HDF5 issue, and -debug to provide a traceback if
>> an exception is raised.
>>
>> It's interesting this happened for you just changing distributions. I'm
>> hoping you re-installed openmpi, hdf5, etc. on the new OS rather than
>> copying your installations from your old OS(?)
>>
>> Best wishes,
>> --------
>> Ryan
>>
>>
>> On Tue, Oct 15, 2024 at 2:51 AM Gabriel Pérez Callejo <
>> gabriel.perez.callejo at uva.es> wrote:
>>
>>> Dear all,
>>>
>>> I have been using FLASH for a while in Ubuntu 18, and am moving now to
>>> use the linux distribution OpenSUSE. However, when running flash in
>>> parallel mode, I am encountering the following problem.
>>>
>>> I am testing the LaserSlab example, with FLASH4.6.2, using hdf5-1.10.7,
>>> hypre-2.11.2 and openmpi-4.0.5 (same as I used in Ubuntu 18).
>>>
>>> I am launching the simulation by using *"./setup -auto LaserSlab -2d
>>> +cylindrical -nxb=16 -nyb=16 +hdf5typeio species=cha*
>>> *m,targ +mtmmmt +laser +uhd3t +mgd mgd_meshgroups=6
>>> -parfile=example.par" *then moving to the *object* directory, using *"make
>>> -j"* and after SUCCESS running "*mpirun -np 3 flash4"*.
>>>
>>> Now, this is what I used to do in Ubuntu, but what I am finding in this
>>> case is that the calculation is initialized, but after printing *"Initial
>>> dt verified" *nothing else happens. The code does not move forward. I
>>> can see that the chk_0000 file has been generated, but not the plt_0000.
>>>
>>> Has anyone encountered this problem before? Does anyone have any
>>> suggestions on how to fix it?
>>>
>>> Best,
>>> --
>>> *Gabriel Pérez-Callejo*
>>> Profesor Ayudante Doctor (Assistant Professor)
>>> Departamento de Física Teórica, Atómica y Óptica
>>> Universidad de Valladolid
>>> Valladolid, Spain
>>> +34 983 18 6513
>>>
>>>
>>> _______________________________________________
>>> flash-users mailing list
>>> flash-users at flash.rochester.edu
>>>
>>> For list info, including unsubscribe:
>>> https://flash.rochester.edu/mailman/listinfo/flash-users
>>>
>> <lasslab.log><STDERR><STDOUT.txt>
>> _______________________________________________
>> flash-users mailing list
>> flash-users at flash.rochester.edu
>>
>> For list info, including unsubscribe:
>> https://flash.rochester.edu/mailman/listinfo/flash-users
>>
>>
>> _______________________________________________
> flash-users mailing list
> flash-users at flash.rochester.edu
>
> For list info, including unsubscribe:
> https://flash.rochester.edu/mailman/listinfo/flash-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20241015/a4186f30/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: FLASH.jpg
Type: image/jpeg
Size: 23876 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20241015/a4186f30/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: FLASH-pride-sml.png
Type: image/png
Size: 12732 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20241015/a4186f30/attachment-0001.png>


More information about the flash-users mailing list