[FLASH-USERS] bug in Driver_evolveFlash.F90 (any implementation); exit on elapsed wall clock time not correctly implemented

Christoph Federrath christoph.federrath at monash.edu
Mon Aug 12 18:13:39 EDT 2013


Hi Klaus,

thanks for confirming the bug. I think the problem is not that the wall clock timer is out of sync on different MPI tasks. It is instead that the MPI tasks do not reach each point in the evolution loop at exactly the same time (unless an MPI_Barrier or collective MPI call is included) and so some tasks may already have started the next loop increment. and others that reached the point later (for which wall clock time is actually reached, exit the loop). So, some MPI tasks are in the new loop increment (with incremented loop counter, step counter, etc.) and others jumped already out of the loop, which then produces inconsistent behavior, in particular not all processes reach collective calls in the following (in other places, e.g., in output), so the code hangs. What the code below does is simply to get all MPI tasks in sync when they should be in sync and only the master MPI task signals that the code should stop. Note that this is the way FLASH2.5 handles exit on elapsed wall clock time, so I ported it from there.

I received some concerns about runtime issues, because of this global communication call. However, I don't think that this single call will make any noticeable difference to the performance. An improved implementation would be to make the collective MPI call only when all processes get *close* to reaching the maximum wall clock time; how close however, depends on the performance and the simulation, i.e., how long each timestep takes, the machine the code is running on, etc., so the present solution seems the universal (if not most efficient) fix to the problem.

Kind regards,

Christoph

________________________________
Dr. Christoph Federrath
School of Mathematical Sciences,
Monash University,
Clayton, Vic 3800, Australia
+61 3 9905 9760
http://www.ita.uni-heidelberg.de/~chfeder/index.shtml?lang=en


Am 13.08.2013 um 03:35 schrieb Klaus Weide:

> On Mon, 12 Aug 2013, Christoph Federrath wrote:
> 
>> 
>> Hi FLASH developers and users,
>> 
>> I want to report a bug in Driver_evolveFlash.F90 (any implementation). 
>> In case the code needs to be stopped when a maximum wall clock time is 
>> reached, I had code hang-ups in some cases (in particular when running 
>> with thousands of MPI tasks) and some cases it went through to the end. 
>> Looking into Driver_evolveFlash.F90, it seems that some MPI tasks can 
>> actually get past the exit on max wallclock time (if for them the max 
>> wall clock had not been reached yet), while others may already exit the 
>> evolution loop, producing inconsistent behavior and hang-ups when they 
>> try to communicate globally the next time. I fixed this by porting 
>> FLASH2.5 code from flash.F90. This needs to be replaced in 
>> Driver_evolveFlash.F90:
>> 
>> 
>>    call Driver_getElapsedWCTime(dr_elapsedWCTime)
>>    if (dr_elapsedWCTime >  dr_wallClockTimeLimit) then
>>       if(dr_globalMe == MASTER_PE) then
>>          print *, "exiting: reached max wall clock time"
>>       end if
>>       exit
>>    end if
>> 
>> 
>> with
>> 
>> 
>>    ! Christoph Federrath 2013 replaced below, because of hang-ups, if one process went beyond this point
>>    ! (wall clock time not yet reached), while another exits the loop here. Ported from FLASH2.5.
>>    call Driver_getElapsedWCTime(dr_elapsedWCTime)
>>    endtime = .false.
>>    if (dr_globalMe == MASTER_PE) then
>>        if (dr_elapsedWCTime >  dr_wallClockTimeLimit) endtime = .true.
>>        print *, "exiting: reached max wall clock time"
>>    endif
>>    call MPI_Bcast(endtime, 1, MPI_LOGICAL, MASTER_PE, MPI_COMM_WORLD, ierr)
>>    if (endtime) exit
>> 
>> 
>> and appropriate includes and definitions must be included for this to work:
>> 
>> 
>> include "Flash_mpi.h"
>> logical :: endtime
>> integer :: ierr
>> 
>> 
>> I hope this helps. Please confirm that this is a bug in FLASH4.0.1 (latest public release version).
> 
> Hi Christoph,
> 
> I agree with your analysis, although I am surprised that you would have 
> encountered a problem caused by wall clock timers getting out of synch 
> several times - I would have thought this unlikely.
> 
> Your fix should work.  It appears to use the same logic that is already 
> applied in IO_output.F90, where elapsed time is used in a condition for 
> dropping a checkpoint.
> 
> We will include a fix like yours in the next FLASH release.
> 
> Thanks you for the report!
> 
> Klaus




More information about the flash-users mailing list