[FLASH-USERS] issues with IBM MPI and hypre

Klaus Weide klaus at flash.uchicago.edu
Thu Apr 18 12:18:39 EDT 2013


On Thu, 18 Apr 2013, Roman Yurchak wrote:

>   I'm trying to run a simulation derived from LaserSlab at
> ada.idris.fr (Intel cpus with IBM's Parallel Operating Environment: poe
> ), and is is frequently  staying forever on a random time-step due to
> some MPI communication errors within hypre (with no error messages
> printed). The same simulation runs perfectly well with ifort/icc and
> Open MPI on another machine.
> 
>   Now, on ada.idris.fr there is ifort/icc  12.1.0, IBM MPI,  hypre
> 2.9.0b and I'm using the svn version of FLASH from Oct 12 (I should
> probably update to  FLASH 4.0.1 ). See log file
> http://perso.crans.org/yurchak/i/flash.log.txt for compilation flags and
> setup arguments. Few remarks:
>    * the failures are reproducible: for a given setup, the simulation
> would always hung on the same time step.
>    * changing hydo solver parameters, cfl, etc. seems to only change the
> time step when it would happen.
>    * tried to recompile FLASH and hypre with -00 without much success
>    * some debugging tells that when it happens, the processes are
> approximatively in the following state:
>         LapiImpl::Context:Advance
>         MPIC_Wait()
>         MPIR_Allreduce_intra()
>         hypre_GMRESSetup()
>         diff_advancetherm()
>   (see http://perso.crans.org/yurchak/i/debug_tv_hypre.png for a more
> complete snapshot of one of the processes with totalview)

Hi Roman!

1.) Yes, you should definitely use a newer version of FLASH. In
particular, there have been changes in how we call HYPRE for solving
the radiation (and heat conduction) problem, including an important
bug fix. This bug fix is included in the FLASH 4.0.1 patch available
from <URL: http://flash.uchicago.edu/site/flashcode/user_support/> and
in the updated FLASH 4.0.1 tarball available at the usual place.
However, since you have access to the svn version, you should just
update your working copy of the flash4 branch. When you do so, update
(at least) to r18826.

After updating your FLASH code, you should be able to use 
   gr_hypresolvertype          = HYPRE_PCG
(the default) instead of 
   gr_hypresolvertype          = HYPRE_GMRES
since the matrices in the problems solved by HYPRE will be more
exactly symmetric.


2.) The above probably does not address your immediate issue.
However, we have seen similar problems when using HYPRE 2.9.0b, that
did not occur with earlier versions of HYPRE. I reported this to the
HYPRE team in January, here is part of the response I received:

==================================================================
Subject: [issue1031] BoomerAMG communicator leak with HYPRE 2.9.0b

Ulrike Yang added the comment:

Hi Klaus,
I believe that this is actually a problem with certain MPI
implementations not allowing enough communicators, i.e. even freed
communicators get counted. We tried to improve some part of the AMG
code, and theoretically that should have worked, but unfortunately ran
into this MPI problem. We realized this only after the release was
already out. I assume you are using Gaussian elimination on the
coarsest grid, if you instead use a few sweeps of a smoother on the
coarsest grid (via HYRE_BoomerAMGSetCycleRelaxType (.,.,3) and
HYPRE_BoomerAMGSetCycleNumSweeps(.,.,3) ) I would expect the problem
to go away.
==================================================================

I have tried the workaround described above, and found that it sucessfully 
eliminated the problem that showed up in MPI.  So far we have not put that 
workaround in the FLASH code (neither the 4.0.1 not the svn flash4 
branch), since it will slightly change solutions, and we have not 
sufficiently tested whether there are any undesirable effects. (Also, this 
is just a workaround, necessary only on some platforms with a specific
HYPRE version, and we mostly use earlier versions.)

My suggestions to you (in addition to 1.) above!):
a) Verify that your problem disappears with HYPRE 2.7.0b or 2.8.0b.
   If yes:
   b) Use the earlier HYPRE version,
      OR
   c) Implement the workaround.

If you choose c), contact me if you need details on where to put those 
calls.


Klaus



More information about the flash-users mailing list