[FLASH-USERS] 4000+ cpus on Franklin
James Guillochon
jfg at ucolick.org
Tue Sep 1 13:31:57 EDT 2009
Hi Klaus,
I am using FLASH 3.0. The problem occurs immediately after restart,
before the first time step. Here's a copy of the log before aborting:
[ 08-31-2009 21:56:16.578 ] [GRID amr_refine_derefine]: initiating
refinement
[GRID amr_refine_derefine] min blks 17 max blks 21 tot blks 76393
[GRID amr_refine_derefine] min leaf blks 13 max leaf blks 17 tot
leaf blks 66844
[ 08-31-2009 21:56:16.655 ] [GRID amr_refine_derefine]: refinement
complete
[DRIVER_ABORT] Driver_abort() called by PE 4093
abort_message [flash_convert_cc_hook] Trying to convert non-zero mass-
specific variable to per-volume form, but dens is zero!
Here's the standard output:
file: wdacc_hdf5_chk_00170 opened for restart
read_data: read 76393 blocks.
io_readData: finished reading input file.
[Eos_init] Cannot open helm_table.bdat!
[Eos_init] Trying old helm_table.dat!
Source terms initialized
don_dist, don_mass 3359870241.058920 6.5677519229596569E+032
[EOS Helmholtz] WARNING! Mask setting does not speed up Eos
Helmholtz calls
iteration, no. not moved = 0 76246
iteration, no. not moved = 1 42904
iteration, no. not moved = 2 0
refined: total leaf blocks = 66844
refined: total blocks = 76393
[flash_convert_cc_hook] PE= 4093, ivar= 4, why=2
Trying to convert non-zero mass-specific variable to per-volume
form, but dens is zero!
Application 1231431 exit codes: 1
Application 1231431 exit signals: Killed
Application 1231431 resources: utime 0, stime 0
The error seems to be happening on one of the very last indexed
processors (There are 4096 processors total, error is happening on
4093), and only on one of them.
I've tried enabling "amr_error_checking" to dump some additional
information, if I enable that option I end up with segmentation faults
on all processors. Here's the standard output just before crashing:
mpi_amr_1blk_restrict: after commsetup: pe 3
mpi_amr_1blk_restrict: after commsetup: pe 2
mpi_amr_1blk_restrict: pe 3 blk 10 ich =
1
mpi_amr_1blk_restrict: pe 3 blk 10 child =
3 11
mpi_amr_1blk_restrict: pe 3 blk 10
cnodetype =
1
mpi_amr_1blk_restrict: pe 3 blk 10 cempty =
0
mpi_amr_1blk_restrict: pe 2 blk 1 ich =
1
mpi_amr_1blk_restrict: pe 3 blk 10 calling
perm
mpi_amr_1blk_restrict: pe 2 blk 1 child =
2 2
mpi_amr_1blk_restrict: pe 2 blk 1
cnodetype =
1
mpi_amr_1blk_restrict: pe 2 blk 1 cempty =
0
mpi_amr_1blk_restrict: pe 2 blk 1 calling
perm
mpi_amr_1blk_restrict: pe 1 blk 2
after reset blockgeom
mpi_amr_1blk_restrict: pe 1 blk 2
bef reset amr_restrict_unk_fun
mpi_amr_1blk_restrict: pe 3 blk 10 exited
perm
mpi_amr_1blk_restrict: pe 2 blk 1 exited
perm
mpi_amr_1blk_restrict: pe 3 blk 10 calling
blockgeom
mpi_amr_1blk_restrict: pe 1 blk 2
aft reset amr_restrict_unk_fun
mpi_amr_1blk_restrict: pe 2 blk 1 calling
blockgeom
mpi_amr_1blk_restrict: pe 1 blk 2 aft lcc
mpi_amr_1blk_restrict: pe 1 blk 2 ich =
3
mpi_amr_1blk_restrict: pe 1 blk 2 child =
1 5
mpi_amr_1blk_restrict: pe 1 blk 2
cnodetype =
1
mpi_amr_1blk_restrict: pe 1 blk 2 cempty =
0
mpi_amr_1blk_restrict: pe 1 blk 2 calling
perm
Application 1231925 exit codes: 139
Application 1231925 exit signals: Killed
Application 1231925 resources: utime 4885, stime 185
Unfortunately, I wasn't able to pin down what I actually did to fix
the problem I had with zero density values a few months ago. I had
been trying many different things, including changing bits of the
actual Simulation code.
Your help is very much appreciated!
--
James Guillochon
Department of Astronomy & Astrophysics
University of California, Santa Cruz
jfg at ucolick.org
On Sep 1, 2009, at 8:26 AM, Klaus Weide wrote:
> On Mon, 31 Aug 2009, James Guillochon wrote:
>
>> Hi all,
>>
>> I'm trying to restart a FLASH simulation on Franklin. If I run on
>> 2000 cpus,
>> the job runs fine, however if I try to push the code to 4000 cpus,
>> I get the
>> following error:
>>
>> abort_message [flash_convert_cc_hook] Trying to convert non-zero
>> mass-specific
>> variable to per-volume form, but dens is zero!
>
> James,
>
> As you have probably noticed, the problem (or at least the symptom) is
> essentially the same as what you reported to flash-users in April:
> unwanted zero values in DENS_VAR. The code in flash_convert_cc_hook,
> which triggers the abort, does essentially the same as the code in
> amr_prolong_gen_unk1_fun, where the problem showed up in your previous
> report. I don't know what was ultimately the cause of the previous
> problem, but you have solved it somehow; could the cause be similar
> this
> time?
>
> Please remind us whether you are using the latest released version of
> FLASH. Also,
> - does this problem occur immediately after restart, or some
> time later?
> - What are the last log file messages before the abort?
> - There should also have been a message on standard output, with
> additional information (PE, ivar, and value). Do you have that?
> - Does the problem occur on several PEs at the same time (your should
> then see several of the standard output messages), or only on one
> CPU?
>
>
> Klaus
>
> !DSPAM:10135,4a9d3d40291851707132194!
>
More information about the flash-users
mailing list