[FLASH-USERS] 4000+ cpus on Franklin
James Guillochon
jfg at ucolick.org
Fri Sep 4 20:10:14 EDT 2009
Some more information on this issue:
I compiled FLASH with -debug and I get the following errors:
1 0: Subscript out of range for array loc_message_size
(mpi_pack_blocks.F90: 146)
2 subscript=-1, lower bound=1, upper bound=54, dimension=1
3 0: Subscript out of range for array loc_message_size
(mpi_pack_blocks.F90: 146)
4 subscript=-1, lower bound=1, upper bound=54, dimension=1
5 0: Subscript out of range for array loc_message_size
(mpi_pack_blocks.F90: 146)
6 subscript=-1, lower bound=1, upper bound=54, dimension=1
7 0: Subscript out of range for array loc_message_size
(mpi_pack_blocks.F90: 146)
8 subscript=-1, lower bound=1, upper bound=54, dimension=1
9 0: Subscript out of range for array loc_message_size
(mpi_pack_blocks.F90: 146)
10 subscript=-1, lower bound=1, upper bound=54, dimension=1
11 0: Subscript out of range for array loc_message_size
(mpi_pack_blocks.F90: 146)
12 subscript=-1, lower bound=1, upper bound=54, dimension=1
13 0: Subscript out of range for array loc_message_size
(mpi_pack_blocks.F90: 146)
14 subscript=-1, lower bound=1, upper bound=54, dimension=1
15 0: Subscript out of range for array loc_message_size
(mpi_pack_blocks.F90: 146)
16 subscript=-1, lower bound=1, upper bound=54, dimension=1
17 0: Subscript out of range for array loc_message_size
(mpi_pack_blocks.F90: 146)
18 subscript=-1, lower bound=1, upper bound=54, dimension=1
19 0: Subscript out of range for array loc_message_size
(mpi_pack_blocks.F90: 146)
20 subscript=-1, lower bound=1, upper bound=54, dimension=1
21 0: Subscript out of range for array loc_message_size
(mpi_pack_blocks.F90: 146)
22 subscript=-1, lower bound=1, upper bound=54, dimension=1
23 0: Subscript out of range for array loc_message_size
(mpi_pack_blocks.F90: 146)
24 subscript=-1, lower bound=1, upper bound=54, dimension=1
Etc, etc. The -1 subscript is coming from the "to_be_sent" array,
which by default is initialized as an array with all entries = -1, but
is populated by MPI calls in another function. So it seems like the
error leading to my crash is somewhere in the PM3 mpi_source directory.
Anyone familiar with that part of the code? I would upgrade to 3.2 to
see if PM4 fixes the problem, but I am not sure if I can restart from
my 3.0 checkpoint if I do that.
Thanks all!
--
James Guillochon
Department of Astronomy & Astrophysics
University of California, Santa Cruz
jfg at ucolick.org
On Sep 1, 2009, at 8:26 AM, Klaus Weide wrote:
> On Mon, 31 Aug 2009, James Guillochon wrote:
>
>> Hi all,
>>
>> I'm trying to restart a FLASH simulation on Franklin. If I run on
>> 2000 cpus,
>> the job runs fine, however if I try to push the code to 4000 cpus,
>> I get the
>> following error:
>>
>> abort_message [flash_convert_cc_hook] Trying to convert non-zero
>> mass-specific
>> variable to per-volume form, but dens is zero!
>
> James,
>
> As you have probably noticed, the problem (or at least the symptom) is
> essentially the same as what you reported to flash-users in April:
> unwanted zero values in DENS_VAR. The code in flash_convert_cc_hook,
> which triggers the abort, does essentially the same as the code in
> amr_prolong_gen_unk1_fun, where the problem showed up in your previous
> report. I don't know what was ultimately the cause of the previous
> problem, but you have solved it somehow; could the cause be similar
> this
> time?
>
> Please remind us whether you are using the latest released version of
> FLASH. Also,
> - does this problem occur immediately after restart, or some
> time later?
> - What are the last log file messages before the abort?
> - There should also have been a message on standard output, with
> additional information (PE, ivar, and value). Do you have that?
> - Does the problem occur on several PEs at the same time (your should
> then see several of the standard output messages), or only on one
> CPU?
>
>
> Klaus
>
> !DSPAM:10135,4a9d3d40291851707132194!
>
More information about the flash-users
mailing list