[FLASH-USERS] Issue/Fix: Silent hangs in Paramesh
Lewis,Sean
scl63 at drexel.edu
Wed Aug 5 12:32:53 EDT 2020
Hello all,
An issue/bug has been discovered up in the behavior of Paramesh in FLASH 4.5 and 4.6.1. Specifically, a run will hang and produce no error or abort while in the subroutine mpi_xchange_blocks(…) located in /source/Grid/GridMain/paramesh/paramesh4/Paramesh4dev/PM4_package/mpi_source/mpi_lib.F90. The hang is semi-random with bitwise identical runs stalling at nearly, but not exactly, the same location. The stall occurs when one processor becomes stuck in one of the MPI_Ssend calls seemingly unable to complete the data transfer while the receiving processor waits indefinitely. I found that this occurs only in cross-node communications and so appears to be due to a bug within OpenMPI (I am using version 3.1.1) and its use of Infiniband communication methods.
Ultimately, the issue appears to be due to the get/put protocols used by OpenMPI for large messages and can be resolved by forcing OpenMPI to communicate via Infiniband using the send/recv protocols for ALL messages (normally used for sub 1Mb messages only). To do this, set the following mca parameter in your mpirun/mpiexec call:
--mca btl_openib_flags 305
I have not experienced the stalling/hang issue since setting this parameter.
Conversations of a similar nature have cropped up in the past, with suggested fixes such as downgrading gcc versions (http://flash.uchicago.edu/pipermail/flash-users/2015-February/001637.html<https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fflash.uchicago.edu%2Fpipermail%2Fflash-users%2F2015-February%2F001637.html&data=02%7C01%7Cscl63%40drexel.edu%7Ca52406dc63d9492e3bc708d82522a503%7C3664e6fa47bd45a696708c4f080f8ca6%7C0%7C0%7C637300178016653772&sdata=AgrqMgUdEmc7%2FbPGFWInZSz%2Fkbf%2FOh0VReKO0C08n6w%3D&reserved=0>) and increasing maxblocks (http://flash.uchicago.edu/pipermail/flash-users/2018-June/002652.html). But I have found this fix to be the most successful. The main drawback of this fix is that the send/recv protocols are not as efficient as the get/put protocols so a performance hit is inevitable, but I have found this to be on the order of 10% slower in my personal runs. To be more clear, the stalling/hang issue has not been seen by myself or colleagues while using vanilla FLASH or in standard test problems. However, in my application, no modifications were made to the offending subroutine. Regardless, the issue seems to be a step removed from FLASH in OpenMPI communication protocols. I wanted to share my findings in case anyone else ever runs into similar issues in the future.
Hope this is helpful!
-Sean
Sean C. Lewis
Doctoral Candidate
Department of Physics
Drexel University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20200805/d34c6f56/attachment.htm>
More information about the flash-users
mailing list