[FLASH-USERS] Flash not progressing after 'Initial dt verified'
Reyes, Adam
adam.reyes at rochester.edu
Tue Oct 15 06:37:00 EDT 2024
Hi Gabriel & Ryan,
Does the stalling happen only for parallel runs (mpirun -np > 1)? It seems likely that it could be stalling during some mpi communication. In my experience making sure all the dependencies are built consistently with the same compilers & mpi has helped.
As for pinpointing where exactly the stall is happening there are a couple of things you can try:
* Setup with "-defines=DEBUG ALL”, this will turn on a lot of debugging messages in all the FLASH units.
* you can add a line like
> print *, “timers start”, name
to “Timers_start/stop.F90
Both of these should print plenty of messages to help narrow down where exactly the code is stalling.
*********************************************
Adam Reyes

Code Group Leader, Flash Center for Computational Science
Research Scientist, Dept. of Physics and Astronomy
University of Rochester
River Campus: Bausch and Lomb Hall, 369
500 Wilson Blvd. PO Box 270171, Rochester, NY 14627
Email adam.reyes at rochester.edu
Web https://flash.rochester.edu
(he / him / his)

*********************************************
> On Oct 15, 2024, at 12:18 PM, Gabriel Pérez Callejo <gabriel.perez.callejo at uva.es> wrote:
>
> Hi Ryan,
>
> Thanks for the quick response. I am attaching to this email the STDOUT, STDERR and log files.
>
> To answer your questions, the simulation does stall. The ps command shows the parallel processes as active, as well as the mpirun, but no progress is done, nothing is printed in the log, STDOUT or STDERR files, and if I run a *top* command, the machine is not working on FLASH.
>
> I have retried including +noio and -debug in my setup command, but it works identically, same problem.
>
> Best,
>
> Gabriel Pérez-Callejo
> Profesor Ayudante Doctor (Assistant Professor)
> Departamento de Física Teórica, Atómica y Óptica
> Universidad de Valladolid
> Valladolid, Spain
> +34 983 18 6513
>
>
> El 15/10/24 a las 12:00, Ryan Farber escribió:
>> Hi Gabriel,
>>
>> I have encountered (and am to some extent still trying to understand) a similar, possibly the same, issue (also with FLASH 4.8). I think the usual issue I encounter is caused due to running out of memory, but it may also be related to HDF5...
>>
>> Regarding your issue, does the run just stall? Such that ps aux | grep flash shows the process is running but the simulation makes no progress in outputting to your log file or STDOUT/STDERR file(s)?
>>
>> Or does the run die? [Some error is encountered / ps no longer shows the process or it's in a completing, i/o, or zombie state.]
>>
>> It would be helpful if you can attach your log file and your STDOUT/STDERR file(s). It would also be useful if you try using +noio to determine if you have an HDF5 issue, and -debug to provide a traceback if an exception is raised.
>>
>> It's interesting this happened for you just changing distributions. I'm hoping you re-installed openmpi, hdf5, etc. on the new OS rather than copying your installations from your old OS(?)
>>
>> Best wishes,
>> --------
>> Ryan
>>
>>
>> On Tue, Oct 15, 2024 at 2:51 AM Gabriel Pérez Callejo <gabriel.perez.callejo at uva.es <mailto:gabriel.perez.callejo at uva.es>> wrote:
>>> Dear all,
>>>
>>> I have been using FLASH for a while in Ubuntu 18, and am moving now to use the linux distribution OpenSUSE. However, when running flash in parallel mode, I am encountering the following problem.
>>>
>>> I am testing the LaserSlab example, with FLASH4.6.2, using hdf5-1.10.7, hypre-2.11.2 and openmpi-4.0.5 (same as I used in Ubuntu 18).
>>>
>>> I am launching the simulation by using "./setup -auto LaserSlab -2d +cylindrical -nxb=16 -nyb=16 +hdf5typeio species=cha
>>> m,targ +mtmmmt +laser +uhd3t +mgd mgd_meshgroups=6 -parfile=example.par" then moving to the object directory, using "make -j" and after SUCCESS running "mpirun -np 3 flash4".
>>>
>>> Now, this is what I used to do in Ubuntu, but what I am finding in this case is that the calculation is initialized, but after printing "Initial dt verified" nothing else happens. The code does not move forward. I can see that the chk_0000 file has been generated, but not the plt_0000.
>>>
>>> Has anyone encountered this problem before? Does anyone have any suggestions on how to fix it?
>>>
>>> Best,
>>>
>>> --
>>> Gabriel Pérez-Callejo
>>> Profesor Ayudante Doctor (Assistant Professor)
>>> Departamento de Física Teórica, Atómica y Óptica
>>> Universidad de Valladolid
>>> Valladolid, Spain
>>> +34 983 18 6513
>>>
>>>
>>> _______________________________________________
>>> flash-users mailing list
>>> flash-users at flash.rochester.edu <mailto:flash-users at flash.rochester.edu>
>>>
>>> For list info, including unsubscribe:
>>> https://flash.rochester.edu/mailman/listinfo/flash-users
> <lasslab.log><STDERR><STDOUT.txt>_______________________________________________
> flash-users mailing list
> flash-users at flash.rochester.edu
>
> For list info, including unsubscribe:
> https://flash.rochester.edu/mailman/listinfo/flash-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20241015/6cb144d9/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: FLASH.jpg
Type: image/jpeg
Size: 23876 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20241015/6cb144d9/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: FLASH-pride-sml.png
Type: image/png
Size: 12732 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20241015/6cb144d9/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1391 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20241015/6cb144d9/attachment-0001.p7s>
More information about the flash-users
mailing list