[FLASH-USERS] Flash not progressing after 'Initial dt verified'
Ryan Farber
rjfarber at umich.edu
Tue Oct 15 07:59:56 EDT 2024
Hi Gabriel,
Unfortunately, I haven't had a "stalling" FLASH issue in quite some time.
>From what I recall (aligned with what Adam mentioned), I found my issue
occurred during MPI communication. I think my fix was to switch MPI
packages. So, you might want to try using a newer version of openmpi (in
case openSUSE had a bug in 4.0.5 which got patched) or switch to mpich etc.
If you really want to trace where the stall occurs, you can try using a
parallel debugging tool to step through your code (or manually by adding
write statements).
Best wishes,
On Tue, Oct 15, 2024 at 4:37 AM Gabriel Pérez Callejo <
gabriel.perez.callejo at uva.es> wrote:
> Dear both,
> Thanks for the help. I have added the DEBUG line to the setup and the
> "print" line to both Timers_start and Timers_stop. I am attaching the new
> log, stdout and stderr files (the STDOUT file is now significatly larger).
> I am still quite unsure what is making this stall... The stalling only
> happens for parallel runs indeed.
> Thanks again,
> *Gabriel Pérez-Callejo*
> Profesor Ayudante Doctor (Assistant Professor)
> Departamento de Física Teórica, Atómica y Óptica
> Universidad de Valladolid
> Valladolid, Spain
> +34 983 18 6513
> El 15/10/24 a las 12:37, Reyes, Adam escribió:
> Hi Gabriel & Ryan,
> Does the stalling happen only for parallel runs (mpirun -np > 1)? It seems
> likely that it could be stalling during some mpi communication. In my
> experience making sure all the dependencies are built consistently with the
> same compilers & mpi has helped.
> As for pinpointing where exactly the stall is happening there are a couple
> of things you can try:
> * Setup with "-defines=DEBUG ALL”, this will turn on a lot of debugging
> messages in all the FLASH units.
> * you can add a line like
> print *, “timers start”, name
> to “Timers_start/stop.F90
> Both of these should print plenty of messages to help narrow down where
> exactly the code is stalling.
> *********************************************
> Adam Reyes
> [image: FLASH.jpg]
> Code Group Leader, Flash Center for Computational Science
> Research Scientist, Dept. of Physics and Astronomy
> University of Rochester
> River Campus: Bausch and Lomb Hall, 369
> 500 Wilson Blvd. PO Box 270171, Rochester, NY 14627
> Email adam.reyes at rochester.edu
> Web https://flash.rochester.edu
> (he / him / his)
> [image: FLASH-pride-sml.png]
> *********************************************
> On Oct 15, 2024, at 12:18 PM, Gabriel Pérez Callejo
> <gabriel.perez.callejo at uva.es> <gabriel.perez.callejo at uva.es> wrote:
> Hi Ryan,
> Thanks for the quick response. I am attaching to this email the STDOUT,
> STDERR and log files.
> To answer your questions, the simulation does stall. The ps command shows
> the parallel processes as active, as well as the mpirun, but no progress is
> done, nothing is printed in the log, STDOUT or STDERR files, and if I run a
> *top* command, the machine is not working on FLASH.
> I have retried including +noio and -debug in my setup command, but it
> works identically, same problem.
> Best,
> *Gabriel Pérez-Callejo*
> Profesor Ayudante Doctor (Assistant Professor)
> Departamento de Física Teórica, Atómica y Óptica
> Universidad de Valladolid
> Valladolid, Spain
> +34 983 18 6513
> El 15/10/24 a las 12:00, Ryan Farber escribió:
> Hi Gabriel,
> I have encountered (and am to some extent still trying to understand) a
> similar, possibly the same, issue (also with FLASH 4.8). I think the usual
> issue I encounter is caused due to running out of memory, but it may also
> be related to HDF5...
> Regarding your issue, does the run just stall? Such that ps aux | grep
> flash shows the process is running but the simulation makes no progress in
> outputting to your log file or STDOUT/STDERR file(s)?
> Or does the run die? [Some error is encountered / ps no longer shows the
> process or it's in a completing, i/o, or zombie state.]
> It would be helpful if you can attach your log file and your STDOUT/STDERR
> file(s). It would also be useful if you try using +noio to determine if you
> have an HDF5 issue, and -debug to provide a traceback if an exception is
> raised.
> It's interesting this happened for you just changing distributions. I'm
> hoping you re-installed openmpi, hdf5, etc. on the new OS rather than
> copying your installations from your old OS(?)
> Best wishes,
> --------
> Ryan
> On Tue, Oct 15, 2024 at 2:51 AM Gabriel Pérez Callejo <
> gabriel.perez.callejo at uva.es> wrote:
>> Dear all,
>> I have been using FLASH for a while in Ubuntu 18, and am moving now to
>> use the linux distribution OpenSUSE. However, when running flash in
>> parallel mode, I am encountering the following problem.
>> I am testing the LaserSlab example, with FLASH4.6.2, using hdf5-1.10.7,
>> hypre-2.11.2 and openmpi-4.0.5 (same as I used in Ubuntu 18).
>> I am launching the simulation by using *"./setup -auto LaserSlab -2d
>> +cylindrical -nxb=16 -nyb=16 +hdf5typeio species=cha*
>> *m,targ +mtmmmt +laser +uhd3t +mgd mgd_meshgroups=6 -parfile=example.par"
>> *then moving to the *object* directory, using *"make -j"* and after
>> SUCCESS running "*mpirun -np 3 flash4"*.
>> Now, this is what I used to do in Ubuntu, but what I am finding in this
>> case is that the calculation is initialized, but after printing *"Initial
>> dt verified" *nothing else happens. The code does not move forward. I
>> can see that the chk_0000 file has been generated, but not the plt_0000.
>> Has anyone encountered this problem before? Does anyone have any
>> suggestions on how to fix it?
>> Best,
>> --
>> *Gabriel Pérez-Callejo*
>> Profesor Ayudante Doctor (Assistant Professor)
>> Departamento de Física Teórica, Atómica y Óptica
>> Universidad de Valladolid
>> Valladolid, Spain
>> +34 983 18 6513
>> _______________________________________________
>> flash-users mailing list
>> flash-users at flash.rochester.edu
>> For list info, including unsubscribe:
>> https://flash.rochester.edu/mailman/listinfo/flash-users
> <lasslab.log><STDERR><STDOUT.txt>
> _______________________________________________
> flash-users mailing list
> flash-users at flash.rochester.edu
> For list info, including unsubscribe:
> https://flash.rochester.edu/mailman/listinfo/flash-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20241015/29bbfdc4/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: FLASH.jpg
Type: image/jpeg
Size: 23876 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20241015/29bbfdc4/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: FLASH-pride-sml.png
Type: image/png
Size: 12732 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20241015/29bbfdc4/attachment-0001.png>
More information about the flash-users
mailing list