[FLASH-USERS] 4000+ cpus on Franklin
James Guillochon
jfg at ucolick.org
Mon Sep 7 20:58:36 EDT 2009
OK, I've upgraded to FLASH 3.2...again, I can run the simulation on
2048 processors, but when I try 4096 I get the following errors:
1 aborting job:
2 Fatal error in MPI_Irecv: Invalid tag, error stack:
3 MPI_Irecv(144): MPI_Irecv(buf=0x1373f760, count=378,
MPI_INTEGER, src=4095, tag=16777312, MPI_COMM_WORLD,
request=0x1377aea8) failed
4 MPI_Irecv(97).: Invalid tag, value is 16777312
5 aborting job:
6 Fatal error in MPI_Irecv: Invalid tag, error stack:
7 MPI_Irecv(144): MPI_Irecv(buf=0x1373e040, count=378,
MPI_INTEGER, src=4095, tag=16777315, MPI_COMM_WORLD,
request=0x1377af18) failed
8 MPI_Irecv(97).: Invalid tag, value is 16777315
9 aborting job:
10 Fatal error in MPI_Irecv: Invalid tag, error stack:
11 MPI_Irecv(144): MPI_Irecv(buf=0x13788c98, count=378,
MPI_INTEGER, src=4095, tag=16777313, MPI_COMM_WORLD,
request=0x13741674) failed
12 MPI_Irecv(97).: Invalid tag, value is 16777313
13 aborting job:
14 Fatal error in MPI_Irecv: Invalid tag, error stack:
15 MPI_Irecv(144): MPI_Irecv(buf=0x1373e040, count=378,
MPI_INTEGER, src=4095, tag=16777314, MPI_COMM_WORLD,
request=0x1377af18) failed
16 MPI_Irecv(97).: Invalid tag, value is 16777314
17 aborting job:
18 Fatal error in MPI_Ssend: Invalid tag, error stack:
19 MPI_Ssend(167): MPI_Ssend(buf=0x1354d910, count=378,
MPI_INTEGER, dest=4091, tag=16777312, MPI_COMM_WORLD) failed
20 MPI_Ssend(93).: Invalid tag, value is 16777312
21 [NID 12795]Apid 1253267: initiated application termination
MPI_TAG_UB is about 2 billion on the machine I'm running on, so the
tag numbers here don't seem to be out of range...
I also tried switching the paramesh library to paramesh4dev...but for
some reason it seems unable to find the amr_runtime_parameters file.
I've tried setting "ParameshLibraryMode=true", but that doesn't seem
to help. Here is the error:
1 PGFIO-F-209/OPEN/unit=35/'OLD' specified for file which does
not exist.
2 File name = amr_runtime_parameters
3 In source file amr_set_runtime_parameters.F90, at line
number 86
4 [NID 1593]Apid 1255265: initiated application termination
Thanks in advance!
--
James Guillochon
Department of Astronomy & Astrophysics
University of California, Santa Cruz
jfg at ucolick.org
On Sep 4, 2009, at 8:57 PM, Klaus Weide wrote:
> On Fri, 4 Sep 2009, James Guillochon wrote:
>
>> Etc, etc. The -1 subscript is coming from the "to_be_sent" array,
>> which by
>> default is initialized as an array with all entries = -1, but is
>> populated by
>> MPI calls in another function. So it seems like the error leading
>> to my crash
>> is somewhere in the PM3 mpi_source directory.
>>
>> Anyone familiar with that part of the code? I would upgrade to 3.2
>> to see if
>> PM4 fixes the problem, but I am not sure if I can restart from my 3.0
>> checkpoint if I do that.
>
> James,
>
> Yes it is a good idea to upgrade to 3.2. Quite a bit of experience
> with
> running on large numbers of processors has gone into the code
> development
> since version 3.0 was released. So the cause of the problem may
> well have
> been fixed, with either the Paramesh4.0 or the Paramesh4dev variants
> of
> PARAMESH that come with FLASH3.2.
>
> The format of checkpoint files has changed slightly, but restarting
> with version 3.2 code from 3.0 checkpoint files should work fine.
> (Unless the grid state was somehow already corrupt when the checkpoint
> was written!)
>
> Klaus
>
> !DSPAM:10135,4aa1e1c6491021468!
>
More information about the flash-users
mailing list