[FLASH-USERS] 4000+ cpus on Franklin

James Guillochon jfg at ucolick.org
Tue Sep 1 13:31:57 EDT 2009


Hi Klaus,

I am using FLASH 3.0. The problem occurs immediately after restart,  
before the first time step. Here's a copy of the log before aborting:

[ 08-31-2009  21:56:16.578 ] [GRID amr_refine_derefine]: initiating  
refinement
[GRID amr_refine_derefine] min blks 17    max blks 21    tot blks 76393
[GRID amr_refine_derefine] min leaf blks 13    max leaf blks 17    tot  
leaf blks 66844
[ 08-31-2009  21:56:16.655 ] [GRID amr_refine_derefine]: refinement  
complete
[DRIVER_ABORT] Driver_abort() called by PE         4093
abort_message [flash_convert_cc_hook] Trying to convert non-zero mass- 
specific variable to per-volume form, but dens is zero!

Here's the standard output:

  file: wdacc_hdf5_chk_00170 opened for restart
  read_data:  read         76393  blocks.
  io_readData:  finished reading input file.
  [Eos_init] Cannot open helm_table.bdat!
  [Eos_init] Trying old helm_table.dat!
  Source terms initialized
  don_dist, don_mass    3359870241.058920        6.5677519229596569E+032
  [EOS Helmholtz] WARNING!  Mask setting does not speed up Eos  
Helmholtz calls
   iteration, no. not moved =             0        76246
   iteration, no. not moved =             1        42904
   iteration, no. not moved =             2            0
  refined: total leaf blocks =         66844
  refined: total blocks =         76393
[flash_convert_cc_hook] PE=   4093, ivar=  4, why=2
  Trying to convert non-zero mass-specific variable to per-volume  
form, but dens is zero!
Application 1231431 exit codes: 1
Application 1231431 exit signals: Killed
Application 1231431 resources: utime 0, stime 0

The error seems to be happening on one of the very last indexed  
processors (There are 4096 processors total, error is happening on  
4093), and only on one of them.

I've tried enabling "amr_error_checking" to dump some additional  
information, if I enable that option I end up with segmentation faults  
on all processors. Here's the standard output just before crashing:

  mpi_amr_1blk_restrict: after commsetup: pe             3
  mpi_amr_1blk_restrict: after commsetup: pe             2
  mpi_amr_1blk_restrict: pe             3  blk            10  ich =
             1
  mpi_amr_1blk_restrict: pe             3  blk            10  child =
             3           11
  mpi_amr_1blk_restrict: pe             3  blk            10   
cnodetype =
             1
  mpi_amr_1blk_restrict: pe             3  blk            10  cempty =
             0
  mpi_amr_1blk_restrict: pe             2  blk             1  ich =
             1
  mpi_amr_1blk_restrict: pe             3  blk            10  calling  
perm
  mpi_amr_1blk_restrict: pe             2  blk             1  child =
             2            2
  mpi_amr_1blk_restrict: pe             2  blk             1   
cnodetype =
             1
  mpi_amr_1blk_restrict: pe             2  blk             1  cempty =
             0
  mpi_amr_1blk_restrict: pe             2  blk             1  calling  
perm
  mpi_amr_1blk_restrict: pe             1  blk             2
   after reset blockgeom
  mpi_amr_1blk_restrict: pe             1  blk             2
   bef reset amr_restrict_unk_fun
  mpi_amr_1blk_restrict: pe             3  blk            10  exited  
perm
  mpi_amr_1blk_restrict: pe             2  blk             1  exited  
perm
  mpi_amr_1blk_restrict: pe             3  blk            10  calling  
blockgeom
  mpi_amr_1blk_restrict: pe             1  blk             2
   aft reset amr_restrict_unk_fun
  mpi_amr_1blk_restrict: pe             2  blk             1  calling  
blockgeom
  mpi_amr_1blk_restrict: pe             1  blk             2  aft lcc
  mpi_amr_1blk_restrict: pe             1  blk             2  ich =
             3
  mpi_amr_1blk_restrict: pe             1  blk             2  child =
             1            5
  mpi_amr_1blk_restrict: pe             1  blk             2   
cnodetype =
             1
  mpi_amr_1blk_restrict: pe             1  blk             2  cempty =
             0
  mpi_amr_1blk_restrict: pe             1  blk             2  calling  
perm
Application 1231925 exit codes: 139
Application 1231925 exit signals: Killed
Application 1231925 resources: utime 4885, stime 185

Unfortunately, I wasn't able to pin down what I actually did to fix  
the problem I had with zero density values a few months ago. I had  
been trying many different things, including changing bits of the  
actual Simulation code.

Your help is very much appreciated!

-- 
James Guillochon
Department of Astronomy & Astrophysics
University of California, Santa Cruz
jfg at ucolick.org

On Sep 1, 2009, at 8:26 AM, Klaus Weide wrote:

> On Mon, 31 Aug 2009, James Guillochon wrote:
>
>> Hi all,
>>
>> I'm trying to restart a FLASH simulation on Franklin. If I run on  
>> 2000 cpus,
>> the job runs fine, however if I try to push the code to 4000 cpus,  
>> I get the
>> following error:
>>
>> abort_message [flash_convert_cc_hook] Trying to convert non-zero  
>> mass-specific
>> variable to per-volume form, but dens is zero!
>
> James,
>
> As you have probably noticed, the problem (or at least the symptom) is
> essentially the same as what you reported to flash-users in April:
> unwanted zero values in DENS_VAR.  The code in flash_convert_cc_hook,
> which triggers the abort, does essentially the same as the code in
> amr_prolong_gen_unk1_fun, where the problem showed up in your previous
> report.  I don't know what was ultimately the cause of the previous
> problem, but you have solved it somehow; could the cause be similar  
> this
> time?
>
> Please remind us whether you are using the latest released version of
> FLASH.  Also,
> - does this problem occur immediately after restart, or some
>   time later?
> - What are the last log file messages before the abort?
> - There should also have been a message on standard output, with
>   additional information (PE, ivar, and value). Do you have that?
> - Does the problem occur on several PEs at the same time (your should
>   then see several of the standard output messages), or only on one  
> CPU?
>
>
> Klaus
>
> !DSPAM:10135,4a9d3d40291851707132194!
>




More information about the flash-users mailing list