[FLASH-USERS] 4000+ cpus on Franklin

John ZuHone jzuhone at cfa.harvard.edu
Fri Sep 4 20:20:09 EDT 2009


James,

My (hopefully educated) guess based on my experience is that a restart  
with the later version should work, as the data structures stored in  
the checkpoint are the same (someone please correct me if not!). If it  
does you will of course maybe want to try an earlier checkpoint to  
make sure FLASH gives the same answers.

Of course this doesn't resolve the bug but would be a nice check.

Best,

John

==========================

Sent from John ZuHone's iPhone

On Sep 4, 2009, at 8:10 PM, James Guillochon <jfg at ucolick.org> wrote:

> Some more information on this issue:
>
> I compiled FLASH with -debug and I get the following errors:
>
>      1 0: Subscript out of range for array loc_message_size  
> (mpi_pack_blocks.F90: 146)
>      2     subscript=-1, lower bound=1, upper bound=54, dimension=1
>      3 0: Subscript out of range for array loc_message_size  
> (mpi_pack_blocks.F90: 146)
>      4     subscript=-1, lower bound=1, upper bound=54, dimension=1
>      5 0: Subscript out of range for array loc_message_size  
> (mpi_pack_blocks.F90: 146)
>      6     subscript=-1, lower bound=1, upper bound=54, dimension=1
>      7 0: Subscript out of range for array loc_message_size  
> (mpi_pack_blocks.F90: 146)
>      8     subscript=-1, lower bound=1, upper bound=54, dimension=1
>      9 0: Subscript out of range for array loc_message_size  
> (mpi_pack_blocks.F90: 146)
>     10     subscript=-1, lower bound=1, upper bound=54, dimension=1
>     11 0: Subscript out of range for array loc_message_size  
> (mpi_pack_blocks.F90: 146)
>     12     subscript=-1, lower bound=1, upper bound=54, dimension=1
>     13 0: Subscript out of range for array loc_message_size  
> (mpi_pack_blocks.F90: 146)
>     14     subscript=-1, lower bound=1, upper bound=54, dimension=1
>     15 0: Subscript out of range for array loc_message_size  
> (mpi_pack_blocks.F90: 146)
>     16     subscript=-1, lower bound=1, upper bound=54, dimension=1
>     17 0: Subscript out of range for array loc_message_size  
> (mpi_pack_blocks.F90: 146)
>     18     subscript=-1, lower bound=1, upper bound=54, dimension=1
>     19 0: Subscript out of range for array loc_message_size  
> (mpi_pack_blocks.F90: 146)
>     20     subscript=-1, lower bound=1, upper bound=54, dimension=1
>     21 0: Subscript out of range for array loc_message_size  
> (mpi_pack_blocks.F90: 146)
>     22     subscript=-1, lower bound=1, upper bound=54, dimension=1
>     23 0: Subscript out of range for array loc_message_size  
> (mpi_pack_blocks.F90: 146)
>     24     subscript=-1, lower bound=1, upper bound=54, dimension=1
>
> Etc, etc. The -1 subscript is coming from the "to_be_sent" array,  
> which by default is initialized as an array with all entries = -1,  
> but is populated by MPI calls in another function. So it seems like  
> the error leading to my crash is somewhere in the PM3 mpi_source  
> directory.
>
> Anyone familiar with that part of the code? I would upgrade to 3.2  
> to see if PM4 fixes the problem, but I am not sure if I can restart  
> from my 3.0 checkpoint if I do that.
>
> Thanks all!
>
> -- 
> James Guillochon
> Department of Astronomy & Astrophysics
> University of California, Santa Cruz
> jfg at ucolick.org
>
> On Sep 1, 2009, at 8:26 AM, Klaus Weide wrote:
>
>> On Mon, 31 Aug 2009, James Guillochon wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to restart a FLASH simulation on Franklin. If I run on  
>>> 2000 cpus,
>>> the job runs fine, however if I try to push the code to 4000 cpus,  
>>> I get the
>>> following error:
>>>
>>> abort_message [flash_convert_cc_hook] Trying to convert non-zero  
>>> mass-specific
>>> variable to per-volume form, but dens is zero!
>>
>> James,
>>
>> As you have probably noticed, the problem (or at least the symptom)  
>> is
>> essentially the same as what you reported to flash-users in April:
>> unwanted zero values in DENS_VAR.  The code in flash_convert_cc_hook,
>> which triggers the abort, does essentially the same as the code in
>> amr_prolong_gen_unk1_fun, where the problem showed up in your  
>> previous
>> report.  I don't know what was ultimately the cause of the previous
>> problem, but you have solved it somehow; could the cause be similar  
>> this
>> time?
>>
>> Please remind us whether you are using the latest released version of
>> FLASH.  Also,
>> - does this problem occur immediately after restart, or some
>>  time later?
>> - What are the last log file messages before the abort?
>> - There should also have been a message on standard output, with
>>  additional information (PE, ivar, and value). Do you have that?
>> - Does the problem occur on several PEs at the same time (your should
>>  then see several of the standard output messages), or only on one  
>> CPU?
>>
>>
>> Klaus
>>
>> !DSPAM:10135,4a9d3d40291851707132194!
>>



More information about the flash-users mailing list