[FLASH-USERS] Changing number of processors on restart and Driver_setupParallelEnv

Mon Dec 17 10:55:26 EST 2018

On Sat, 15 Dec 2018, Joshua Wall wrote:

>     I'm currently experiencing a hang when calling MPI_Comm_Split in
> Driver_setupParallelEnv on a restart of a checkpoint file (the first call
> to get dr_meshComm). I am changing (increasing actually) the number of
> processors on this restart, since the code overran the maxblocks number
> during its AMR in this run. I suspect this has something to do with this
> crash.

Hello Joshua,

I don't see how this would be related to restarting from a checkpoint that 
was generated by a smaller number of tasks. The FLASH checkpoints should 
be essentially the same, independent of the number of tasks of the run 
that dumped them. I would expect that you should also see the same 
behavior if you restart from scratch on the same number of procs.

>    So two questions,
>     1) is there a straightforward way to sort this out? I couldn't find
> anything in the mailing list archive or
>     2) if not, can I safely just set all of the comms to be equal to
> dr_globalComm? I can't seem to find any code via a grep that uses the comms
> set up in this routine (dr_meshComm, dr_meshAcrossComm, dr_axisComm).

Which one of the MPI_Comm_Split calls is hanging?

Those various additional communicators generated from 
dr_globalComm (normally ==MPI_COMM_WORLD) in Driver_setupParallelEnv
are meant to be used AND significantly different from dr_globalComm
only in two situations:

 1) The Uniform Grid implementation (UG) is used (FLASH_GRID_UG is 
    defined).

 2) Mesh replication is used. (dr_meshCopyCount > 1).

Assuming that both of those conditions do not apply in your case, it 
should be safe to set those communicator variables to dr_globalcomm or 
perhaps some dummy value.

Klaus