[FLASH-USERS] Changing number of processors on restart and Driver_setupParallelEnv
Klaus Weide
klaus at flash.uchicago.edu
Mon Dec 17 10:55:26 EST 2018
On Sat, 15 Dec 2018, Joshua Wall wrote:
> I'm currently experiencing a hang when calling MPI_Comm_Split in
> Driver_setupParallelEnv on a restart of a checkpoint file (the first call
> to get dr_meshComm). I am changing (increasing actually) the number of
> processors on this restart, since the code overran the maxblocks number
> during its AMR in this run. I suspect this has something to do with this
> crash.
Hello Joshua,
I don't see how this would be related to restarting from a checkpoint that
was generated by a smaller number of tasks. The FLASH checkpoints should
be essentially the same, independent of the number of tasks of the run
that dumped them. I would expect that you should also see the same
behavior if you restart from scratch on the same number of procs.
> So two questions,
> 1) is there a straightforward way to sort this out? I couldn't find
> anything in the mailing list archive or
> 2) if not, can I safely just set all of the comms to be equal to
> dr_globalComm? I can't seem to find any code via a grep that uses the comms
> set up in this routine (dr_meshComm, dr_meshAcrossComm, dr_axisComm).
Which one of the MPI_Comm_Split calls is hanging?
Those various additional communicators generated from
dr_globalComm (normally ==MPI_COMM_WORLD) in Driver_setupParallelEnv
are meant to be used AND significantly different from dr_globalComm
only in two situations:
1) The Uniform Grid implementation (UG) is used (FLASH_GRID_UG is
defined).
2) Mesh replication is used. (dr_meshCopyCount > 1).
Assuming that both of those conditions do not apply in your case, it
should be safe to set those communicator variables to dr_globalcomm or
perhaps some dummy value.
Klaus
More information about the flash-users
mailing list