[FLASH-BUGS] wallclock checkpoint bug

Mike Zingale zingale at flash.uchicago.edu
Wed Feb 5 18:13:39 CST 2003


Sean, I think you are right about this one.  I don't believe that any of
us have run into this before, but it could be a clock syncronization
issue.  In any case, I'll change this to do the time computation on the
master processor shortly.

Mike

------------------------------------------------------------------------------
Michael Zingale
UCO/Lick Observatory
UCSC
Santa Cruz, CA 95064

phone:  (831) 459-5246 
fax:    (831) 459-5265
e-mail: zingale at ucolick.org
web:    http://www.ucolick.org/~zingale

``What an awful dream -- ones and zeros everywhere.  I thought I saw a two''
   -- Bender






On Wed, 5 Feb 2003, Sean Matt wrote:

> Hi,
> 
> 	We've been having problems with our large and relatively long
> simulations (that is, those that run on tens of processors for several
> hours).  The simulations all hang up (stop producing any logfile output,
> but still using up cpu cycles) at some time that is near an integer times
> the wallclock checkpoint time (in our case, 3600 seconds).
> 	The last time this happened, we used totalview to find out the
> problem, and we believe it is a bug.  It turns out that, when the run
> hangs up, some of the processors are waiting on an MPI_Bcast around line
> 371 of the output subroutine ("/source/io/output.F90"), while the others
> are waiting at an MPI_Reduce within the restrict_tree subroutine that is
> called from the output subroutine around line 312.  The restrict_tree in
> question is called when the wallclock time is right for a checkpoint file
> to be written.  So the problem is that some of the processors think it's
> time to write, and others do not.
> 	We believe that this is most likely caused by the way FLASH checks 
> the wallclock time.  Around line 303,
> 
> "dt_checkpoint = MPI_Wtime() - lastWallClockCheckpoint"
> 
> is executed by EACH processor.  The time between the last MPI
> sinchronization will not always be the same for each processor.  In some
> cases, the following if statement ("if (dt_checkpoint >
> wall_clock_checkpoint) then") may be true for some processors, but not
> others.  We believe this if statement should be done by one processor
> only, and the result should be broadcast.  FLASH already does something
> similar to our suggestion for checking for the ".dump_restart" file near
> the end of the output subroutine.
> 
> 
> 		-Sean
> 




More information about the flash-bugs mailing list