<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>
</head>
<body dir="ltr">
<div id="divtagdefaultwrapper" dir="ltr" style="font-size: 12pt; color: rgb(0, 0, 0); font-family: Calibri, Helvetica, sans-serif, EmojiFont, "Apple Color Emoji", "Segoe UI Emoji", NotoColorEmoji, "Segoe UI Symbol", "Android Emoji", EmojiSymbols;">
<p>Dear Flash developper,</p>
<p>I finally found out why flash simulation are systematically hanging on our lustre filesystem when using AMD EPYC 7k compute node ( 128 physical cores ).</p>
<p><br>
</p>
<p></p>
<div>The problem lies in concurrent reading on lustre. </div>
<div><br>
</div>
<div>When FLASH start initialization, all processes needs to read the same input data files ( the .cn4 etc ) and need to allocate</div>
<div>in memory parameters , data arrays etc ...</div>
<div>The problem is that this reading is of course asynchronous ( as it should be with MPI ) but do not need in principle complex synchronization or distributed lock mechanism </div>
<div>which should be only relevant in the writing case.</div>
<div>In fact when too many processes from one client try to read concurently the same .cn4 input file all the processes will hang and the lustre directory get corrupted and not accessible anymore.</div>
<div><span style="font-size:12pt">Moving all the needed input files to the /tmp (strict posix ) on each node ( they are copied once per node )and adapting the </span><span style="font-size:12pt"> flash.par file to read from /tmp and NOT from /lustre</span></div>
<div> solved the issue. </div>
<div><span style="font-size:12pt">Using this approach, flash simulation jobs run now stable and can use all cores / node even on AMD EPYC 7k architecture.</span><br>
</div>
<div>Writing shows no problem since in this case there is no scaling issue ( collective MPI-IO is used )</div>
<div>FLASH I/O writing capability has been tested with a adapted/modified version of the official flash I/O benchmark</div>
<div>in order to be able to run with latest gfortran, MPI and HDF5 libraries.</div>
<div><br>
</div>
<div>For those interested the modified version can be freely downloaded here:</div>
<div><a href="https://git.gsi.de/d.bertini/pp-flash/-/tree/main/flash_io?ref_type=heads" class="OWAAutoLink">https://git.gsi.de/d.bertini/pp-flash/-/tree/main/flash_io?ref_type=heads</a><br>
</div>
<div><br>
</div>
<div>The FLASH I/O writing benchmarks is stable and shows good results on /lustre</div>
<br>
<p></p>
<p>As this seems to be related to a lustre bug ( client or server side ? ) it would be nice to create a small MPI program that just read these .cn4 file to</p>
<p>reproduce this problem.</p>
<p>Could you tell me which routines in flash are reading the .cn4 file for initialisation ?</p>
<p><br>
</p>
<p>Thanks in advance </p>
<p><br>
</p>
<div id="Signature">
<div id="divtagdefaultwrapper" dir="ltr" style="font-size:12pt; color:rgb(0,0,0); font-family:Calibri,Helvetica,sans-serif,EmojiFont,"Apple Color Emoji","Segoe UI Emoji",NotoColorEmoji,"Segoe UI Symbol","Android Emoji",EmojiSymbols">
<div><span style="font-size:9pt">---------</span><span style="font-size:9pt"></span></div>
<div><span style="font-size:9pt">Denis Bertini</span></div>
<span style="font-size:9pt"></span>
<div><span style="font-size:9pt">Abteilung: CIT</span><br>
<span style="font-size:9pt"></span></div>
<span style="font-size:9pt"></span>
<div><span style="font-size:9pt">Ort: SB3 2.265a</span></div>
<span style="font-size:9pt"></span>
<div><br>
<span style="font-size:9pt"></span></div>
<span style="font-size:9pt"></span>
<div><span style="font-size:9pt">Tel: +49 6159 71 2240</span></div>
<span style="font-size:9pt"></span>
<div><span style="font-size:9pt">Fax: +49 6159 71 2986</span></div>
<span style="font-size:9pt"></span>
<div><span style="font-size:9pt">E-Mail: d.bertini@gsi.de</span></div>
<span style="font-size:9pt"></span>
<div><br>
<span style="font-size:9pt"></span></div>
<span style="font-size:9pt"></span>
<div><span style="font-size:9pt">GSI Helmholtzzentrum für Schwerionenforschung GmbH</span></div>
<span style="font-size:9pt"></span>
<div><span style="font-size:9pt">Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de</span></div>
<span style="font-size:9pt"></span>
<div><br>
<span style="font-size:9pt"></span></div>
<span style="font-size:9pt"></span>
<div><span style="font-size:9pt">Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528</span></div>
<span style="font-size:9pt"></span>
<div><span style="font-size:9pt">Managing Directors / Geschäftsführung:</span></div>
<span style="font-size:9pt"></span>
<div><span style="font-size:9pt">Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock</span></div>
<span style="font-size:9pt"></span>
<div><span style="font-size:9pt">Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:</span></div>
<span style="font-size:9pt"></span>
<div><span style="font-size:9pt">Ministerialdirigent Dr. Volkmar Dietz</span></div>
<p></p>
</div>
</div>
</div>
</body>
</html>