[FLASH-USERS] FLASH initialisation hanging on Lustre filesystem

Bertini, Denis Dr. D.Bertini at gsi.de
Wed Jun 19 03:51:52 EDT 2024


Dear Flash developper,

I finally found out why  flash simulation are systematically hanging on our lustre filesystem when using AMD EPYC 7k compute node ( 128 physical cores ).


The problem lies in concurrent reading on lustre.

When FLASH start initialization, all processes  needs to read the same input data files ( the .cn4 etc ) and need to allocate
in memory parameters , data arrays etc ...
The problem is that this reading is of course asynchronous ( as it should be  with MPI )  but do not need in principle complex synchronization or distributed lock mechanism
which should be only relevant in the writing case.
In fact when too many processes from one client try to read concurently the same .cn4 input file all the processes will hang and the lustre directory get corrupted and not accessible anymore.
Moving  all the needed input files to the /tmp (strict posix ) on each node ( they are copied once per node )and adapting the  flash.par  file to read from /tmp and NOT from /lustre
 solved the issue.
Using this approach, flash simulation jobs run now stable and can use all cores / node even on AMD EPYC 7k architecture.
Writing shows no problem  since in this case there is no scaling issue ( collective MPI-IO is used )
FLASH I/O writing capability has been tested with a adapted/modified version of the official  flash I/O benchmark
in order to be able to run with latest gfortran,  MPI and HDF5 libraries.

For those interested the modified version can be freely downloaded here:
https://git.gsi.de/d.bertini/pp-flash/-/tree/main/flash_io?ref_type=heads

The FLASH I/O writing benchmarks is stable and shows good results on /lustre


As this seems to be related to a lustre bug ( client or server side ? ) it would be nice to create a small MPI program that just read these .cn4 file to

reproduce this problem.

Could you tell me which routines in flash are reading the .cn4 file for initialisation ?


Thanks in advance


---------
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a

Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail: d.bertini at gsi.de

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
Ministerialdirigent Dr. Volkmar Dietz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20240619/831dad91/attachment.htm>


More information about the flash-users mailing list