[FLASH-BUGS] wallclock checkpoint bug
Sean Matt
matt at physics.mcmaster.ca
Wed Feb 5 16:33:57 CST 2003
Hi,
We've been having problems with our large and relatively long
simulations (that is, those that run on tens of processors for several
hours). The simulations all hang up (stop producing any logfile output,
but still using up cpu cycles) at some time that is near an integer times
the wallclock checkpoint time (in our case, 3600 seconds).
The last time this happened, we used totalview to find out the
problem, and we believe it is a bug. It turns out that, when the run
hangs up, some of the processors are waiting on an MPI_Bcast around line
371 of the output subroutine ("/source/io/output.F90"), while the others
are waiting at an MPI_Reduce within the restrict_tree subroutine that is
called from the output subroutine around line 312. The restrict_tree in
question is called when the wallclock time is right for a checkpoint file
to be written. So the problem is that some of the processors think it's
time to write, and others do not.
We believe that this is most likely caused by the way FLASH checks
the wallclock time. Around line 303,
"dt_checkpoint = MPI_Wtime() - lastWallClockCheckpoint"
is executed by EACH processor. The time between the last MPI
sinchronization will not always be the same for each processor. In some
cases, the following if statement ("if (dt_checkpoint >
wall_clock_checkpoint) then") may be true for some processors, but not
others. We believe this if statement should be done by one processor
only, and the result should be broadcast. FLASH already does something
similar to our suggestion for checking for the ".dump_restart" file near
the end of the output subroutine.
-Sean
More information about the flash-bugs
mailing list