[FLASH-USERS] FLASH initialisation hanging on Lustre filesystem
Bertini, Denis Dr.
D.Bertini at gsi.de
Wed Jun 19 03:51:52 EDT 2024
Dear Flash developper,
I finally found out why flash simulation are systematically hanging on our lustre filesystem when using AMD EPYC 7k compute node ( 128 physical cores ).
The problem lies in concurrent reading on lustre.
When FLASH start initialization, all processes needs to read the same input data files ( the .cn4 etc ) and need to allocate
in memory parameters , data arrays etc ...
The problem is that this reading is of course asynchronous ( as it should be with MPI ) but do not need in principle complex synchronization or distributed lock mechanism
which should be only relevant in the writing case.
In fact when too many processes from one client try to read concurently the same .cn4 input file all the processes will hang and the lustre directory get corrupted and not accessible anymore.
Moving all the needed input files to the /tmp (strict posix ) on each node ( they are copied once per node )and adapting the flash.par file to read from /tmp and NOT from /lustre
solved the issue.
Using this approach, flash simulation jobs run now stable and can use all cores / node even on AMD EPYC 7k architecture.
Writing shows no problem since in this case there is no scaling issue ( collective MPI-IO is used )
FLASH I/O writing capability has been tested with a adapted/modified version of the official flash I/O benchmark
in order to be able to run with latest gfortran, MPI and HDF5 libraries.
For those interested the modified version can be freely downloaded here:
https://git.gsi.de/d.bertini/pp-flash/-/tree/main/flash_io?ref_type=heads
The FLASH I/O writing benchmarks is stable and shows good results on /lustre
As this seems to be related to a lustre bug ( client or server side ? ) it would be nice to create a small MPI program that just read these .cn4 file to
reproduce this problem.
Could you tell me which routines in flash are reading the .cn4 file for initialisation ?
Thanks in advance
---------
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a
Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail: d.bertini at gsi.de
GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de
Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
Ministerialdirigent Dr. Volkmar Dietz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20240619/831dad91/attachment.htm>
More information about the flash-users
mailing list