[FLASH-USERS] bug in Driver_evolveFlash.F90 (any implementation); exit on elapsed wall clock time not correctly implemented

Klaus Weide klaus at flash.uchicago.edu
Mon Aug 12 13:35:04 EDT 2013


On Mon, 12 Aug 2013, Christoph Federrath wrote:

> 
> Hi FLASH developers and users,
> 
> I want to report a bug in Driver_evolveFlash.F90 (any implementation). 
> In case the code needs to be stopped when a maximum wall clock time is 
> reached, I had code hang-ups in some cases (in particular when running 
> with thousands of MPI tasks) and some cases it went through to the end. 
> Looking into Driver_evolveFlash.F90, it seems that some MPI tasks can 
> actually get past the exit on max wallclock time (if for them the max 
> wall clock had not been reached yet), while others may already exit the 
> evolution loop, producing inconsistent behavior and hang-ups when they 
> try to communicate globally the next time. I fixed this by porting 
> FLASH2.5 code from flash.F90. This needs to be replaced in 
> Driver_evolveFlash.F90:
> 
> 
>     call Driver_getElapsedWCTime(dr_elapsedWCTime)
>     if (dr_elapsedWCTime >  dr_wallClockTimeLimit) then
>        if(dr_globalMe == MASTER_PE) then
>           print *, "exiting: reached max wall clock time"
>        end if
>        exit
>     end if
> 
> 
> with
> 
> 
>     ! Christoph Federrath 2013 replaced below, because of hang-ups, if one process went beyond this point
>     ! (wall clock time not yet reached), while another exits the loop here. Ported from FLASH2.5.
>     call Driver_getElapsedWCTime(dr_elapsedWCTime)
>     endtime = .false.
>     if (dr_globalMe == MASTER_PE) then
>         if (dr_elapsedWCTime >  dr_wallClockTimeLimit) endtime = .true.
>         print *, "exiting: reached max wall clock time"
>     endif
>     call MPI_Bcast(endtime, 1, MPI_LOGICAL, MASTER_PE, MPI_COMM_WORLD, ierr)
>     if (endtime) exit
> 
> 
> and appropriate includes and definitions must be included for this to work:
> 
> 
>  include "Flash_mpi.h"
>  logical :: endtime
>  integer :: ierr
> 
> 
> I hope this helps. Please confirm that this is a bug in FLASH4.0.1 (latest public release version).

Hi Christoph,

I agree with your analysis, although I am surprised that you would have 
encountered a problem caused by wall clock timers getting out of synch 
several times - I would have thought this unlikely.

Your fix should work.  It appears to use the same logic that is already 
applied in IO_output.F90, where elapsed time is used in a condition for 
dropping a checkpoint.

We will include a fix like yours in the next FLASH release.

Thanks you for the report!

Klaus



More information about the flash-users mailing list