[FLASH-BUGS] wallclock checkpoint bug
Mike Zingale
zingale at flash.uchicago.edu
Wed Feb 5 18:13:39 CST 2003
Sean, I think you are right about this one. I don't believe that any of
us have run into this before, but it could be a clock syncronization
issue. In any case, I'll change this to do the time computation on the
master processor shortly.
Mike
------------------------------------------------------------------------------
Michael Zingale
UCO/Lick Observatory
UCSC
Santa Cruz, CA 95064
phone: (831) 459-5246
fax: (831) 459-5265
e-mail: zingale at ucolick.org
web: http://www.ucolick.org/~zingale
``What an awful dream -- ones and zeros everywhere. I thought I saw a two''
-- Bender
On Wed, 5 Feb 2003, Sean Matt wrote:
> Hi,
>
> We've been having problems with our large and relatively long
> simulations (that is, those that run on tens of processors for several
> hours). The simulations all hang up (stop producing any logfile output,
> but still using up cpu cycles) at some time that is near an integer times
> the wallclock checkpoint time (in our case, 3600 seconds).
> The last time this happened, we used totalview to find out the
> problem, and we believe it is a bug. It turns out that, when the run
> hangs up, some of the processors are waiting on an MPI_Bcast around line
> 371 of the output subroutine ("/source/io/output.F90"), while the others
> are waiting at an MPI_Reduce within the restrict_tree subroutine that is
> called from the output subroutine around line 312. The restrict_tree in
> question is called when the wallclock time is right for a checkpoint file
> to be written. So the problem is that some of the processors think it's
> time to write, and others do not.
> We believe that this is most likely caused by the way FLASH checks
> the wallclock time. Around line 303,
>
> "dt_checkpoint = MPI_Wtime() - lastWallClockCheckpoint"
>
> is executed by EACH processor. The time between the last MPI
> sinchronization will not always be the same for each processor. In some
> cases, the following if statement ("if (dt_checkpoint >
> wall_clock_checkpoint) then") may be true for some processors, but not
> others. We believe this if statement should be done by one processor
> only, and the result should be broadcast. FLASH already does something
> similar to our suggestion for checking for the ".dump_restart" file near
> the end of the output subroutine.
>
>
> -Sean
>
More information about the flash-bugs
mailing list