[FLASH-USERS] bug in Driver_evolveFlash.F90 (any implementation); exit on elapsed wall clock time not correctly implemented
Klaus Weide
klaus at flash.uchicago.edu
Mon Aug 12 13:35:04 EDT 2013
On Mon, 12 Aug 2013, Christoph Federrath wrote:
>
> Hi FLASH developers and users,
>
> I want to report a bug in Driver_evolveFlash.F90 (any implementation).
> In case the code needs to be stopped when a maximum wall clock time is
> reached, I had code hang-ups in some cases (in particular when running
> with thousands of MPI tasks) and some cases it went through to the end.
> Looking into Driver_evolveFlash.F90, it seems that some MPI tasks can
> actually get past the exit on max wallclock time (if for them the max
> wall clock had not been reached yet), while others may already exit the
> evolution loop, producing inconsistent behavior and hang-ups when they
> try to communicate globally the next time. I fixed this by porting
> FLASH2.5 code from flash.F90. This needs to be replaced in
> Driver_evolveFlash.F90:
>
>
> call Driver_getElapsedWCTime(dr_elapsedWCTime)
> if (dr_elapsedWCTime > dr_wallClockTimeLimit) then
> if(dr_globalMe == MASTER_PE) then
> print *, "exiting: reached max wall clock time"
> end if
> exit
> end if
>
>
> with
>
>
> ! Christoph Federrath 2013 replaced below, because of hang-ups, if one process went beyond this point
> ! (wall clock time not yet reached), while another exits the loop here. Ported from FLASH2.5.
> call Driver_getElapsedWCTime(dr_elapsedWCTime)
> endtime = .false.
> if (dr_globalMe == MASTER_PE) then
> if (dr_elapsedWCTime > dr_wallClockTimeLimit) endtime = .true.
> print *, "exiting: reached max wall clock time"
> endif
> call MPI_Bcast(endtime, 1, MPI_LOGICAL, MASTER_PE, MPI_COMM_WORLD, ierr)
> if (endtime) exit
>
>
> and appropriate includes and definitions must be included for this to work:
>
>
> include "Flash_mpi.h"
> logical :: endtime
> integer :: ierr
>
>
> I hope this helps. Please confirm that this is a bug in FLASH4.0.1 (latest public release version).
Hi Christoph,
I agree with your analysis, although I am surprised that you would have
encountered a problem caused by wall clock timers getting out of synch
several times - I would have thought this unlikely.
Your fix should work. It appears to use the same logic that is already
applied in IO_output.F90, where elapsed time is used in a condition for
dropping a checkpoint.
We will include a fix like yours in the next FLASH release.
Thanks you for the report!
Klaus
More information about the flash-users
mailing list