[FLASH-USERS] 4000+ cpus on Franklin

James Guillochon jfg at ucolick.org
Fri Sep 4 20:10:14 EDT 2009


Some more information on this issue:

I compiled FLASH with -debug and I get the following errors:

       1 0: Subscript out of range for array loc_message_size  
(mpi_pack_blocks.F90: 146)
       2     subscript=-1, lower bound=1, upper bound=54, dimension=1
       3 0: Subscript out of range for array loc_message_size  
(mpi_pack_blocks.F90: 146)
       4     subscript=-1, lower bound=1, upper bound=54, dimension=1
       5 0: Subscript out of range for array loc_message_size  
(mpi_pack_blocks.F90: 146)
       6     subscript=-1, lower bound=1, upper bound=54, dimension=1
       7 0: Subscript out of range for array loc_message_size  
(mpi_pack_blocks.F90: 146)
       8     subscript=-1, lower bound=1, upper bound=54, dimension=1
       9 0: Subscript out of range for array loc_message_size  
(mpi_pack_blocks.F90: 146)
      10     subscript=-1, lower bound=1, upper bound=54, dimension=1
      11 0: Subscript out of range for array loc_message_size  
(mpi_pack_blocks.F90: 146)
      12     subscript=-1, lower bound=1, upper bound=54, dimension=1
      13 0: Subscript out of range for array loc_message_size  
(mpi_pack_blocks.F90: 146)
      14     subscript=-1, lower bound=1, upper bound=54, dimension=1
      15 0: Subscript out of range for array loc_message_size  
(mpi_pack_blocks.F90: 146)
      16     subscript=-1, lower bound=1, upper bound=54, dimension=1
      17 0: Subscript out of range for array loc_message_size  
(mpi_pack_blocks.F90: 146)
      18     subscript=-1, lower bound=1, upper bound=54, dimension=1
      19 0: Subscript out of range for array loc_message_size  
(mpi_pack_blocks.F90: 146)
      20     subscript=-1, lower bound=1, upper bound=54, dimension=1
      21 0: Subscript out of range for array loc_message_size  
(mpi_pack_blocks.F90: 146)
      22     subscript=-1, lower bound=1, upper bound=54, dimension=1
      23 0: Subscript out of range for array loc_message_size  
(mpi_pack_blocks.F90: 146)
      24     subscript=-1, lower bound=1, upper bound=54, dimension=1

Etc, etc. The -1 subscript is coming from the "to_be_sent" array,  
which by default is initialized as an array with all entries = -1, but  
is populated by MPI calls in another function. So it seems like the  
error leading to my crash is somewhere in the PM3 mpi_source directory.

Anyone familiar with that part of the code? I would upgrade to 3.2 to  
see if PM4 fixes the problem, but I am not sure if I can restart from  
my 3.0 checkpoint if I do that.

Thanks all!

-- 
James Guillochon
Department of Astronomy & Astrophysics
University of California, Santa Cruz
jfg at ucolick.org

On Sep 1, 2009, at 8:26 AM, Klaus Weide wrote:

> On Mon, 31 Aug 2009, James Guillochon wrote:
>
>> Hi all,
>>
>> I'm trying to restart a FLASH simulation on Franklin. If I run on  
>> 2000 cpus,
>> the job runs fine, however if I try to push the code to 4000 cpus,  
>> I get the
>> following error:
>>
>> abort_message [flash_convert_cc_hook] Trying to convert non-zero  
>> mass-specific
>> variable to per-volume form, but dens is zero!
>
> James,
>
> As you have probably noticed, the problem (or at least the symptom) is
> essentially the same as what you reported to flash-users in April:
> unwanted zero values in DENS_VAR.  The code in flash_convert_cc_hook,
> which triggers the abort, does essentially the same as the code in
> amr_prolong_gen_unk1_fun, where the problem showed up in your previous
> report.  I don't know what was ultimately the cause of the previous
> problem, but you have solved it somehow; could the cause be similar  
> this
> time?
>
> Please remind us whether you are using the latest released version of
> FLASH.  Also,
> - does this problem occur immediately after restart, or some
>   time later?
> - What are the last log file messages before the abort?
> - There should also have been a message on standard output, with
>   additional information (PE, ivar, and value). Do you have that?
> - Does the problem occur on several PEs at the same time (your should
>   then see several of the standard output messages), or only on one  
> CPU?
>
>
> Klaus
>
> !DSPAM:10135,4a9d3d40291851707132194!
>




More information about the flash-users mailing list