[FLASH-USERS] 4000+ cpus on Franklin

James Guillochon jfg at ucolick.org
Mon Sep 7 20:58:36 EDT 2009


OK, I've upgraded to FLASH 3.2...again, I can run the simulation on  
2048 processors, but when I try 4096 I get the following errors:

       1 aborting job:
       2 Fatal error in MPI_Irecv: Invalid tag, error stack:
       3 MPI_Irecv(144): MPI_Irecv(buf=0x1373f760, count=378,  
MPI_INTEGER, src=4095, tag=16777312, MPI_COMM_WORLD,  
request=0x1377aea8) failed
       4 MPI_Irecv(97).: Invalid tag, value is 16777312
       5 aborting job:
       6 Fatal error in MPI_Irecv: Invalid tag, error stack:
       7 MPI_Irecv(144): MPI_Irecv(buf=0x1373e040, count=378,  
MPI_INTEGER, src=4095, tag=16777315, MPI_COMM_WORLD,  
request=0x1377af18) failed
       8 MPI_Irecv(97).: Invalid tag, value is 16777315
       9 aborting job:
      10 Fatal error in MPI_Irecv: Invalid tag, error stack:
      11 MPI_Irecv(144): MPI_Irecv(buf=0x13788c98, count=378,  
MPI_INTEGER, src=4095, tag=16777313, MPI_COMM_WORLD,  
request=0x13741674) failed
      12 MPI_Irecv(97).: Invalid tag, value is 16777313
      13 aborting job:
      14 Fatal error in MPI_Irecv: Invalid tag, error stack:
      15 MPI_Irecv(144): MPI_Irecv(buf=0x1373e040, count=378,  
MPI_INTEGER, src=4095, tag=16777314, MPI_COMM_WORLD,  
request=0x1377af18) failed
      16 MPI_Irecv(97).: Invalid tag, value is 16777314
      17 aborting job:
      18 Fatal error in MPI_Ssend: Invalid tag, error stack:
      19 MPI_Ssend(167): MPI_Ssend(buf=0x1354d910, count=378,  
MPI_INTEGER, dest=4091, tag=16777312, MPI_COMM_WORLD) failed
      20 MPI_Ssend(93).: Invalid tag, value is 16777312
      21 [NID 12795]Apid 1253267: initiated application termination

MPI_TAG_UB is about 2 billion on the machine I'm running on, so the  
tag numbers here don't seem to be out of range...

I also tried switching the paramesh library to paramesh4dev...but for  
some reason it seems unable to find the amr_runtime_parameters file.  
I've tried setting "ParameshLibraryMode=true", but that doesn't seem  
to help. Here is the error:

       1 PGFIO-F-209/OPEN/unit=35/'OLD' specified for file which does  
not exist.
       2  File name = amr_runtime_parameters
       3  In source file amr_set_runtime_parameters.F90, at line  
number 86
       4 [NID 1593]Apid 1255265: initiated application termination

Thanks in advance!

-- 
James Guillochon
Department of Astronomy & Astrophysics
University of California, Santa Cruz
jfg at ucolick.org

On Sep 4, 2009, at 8:57 PM, Klaus Weide wrote:

> On Fri, 4 Sep 2009, James Guillochon wrote:
>
>> Etc, etc. The -1 subscript is coming from the "to_be_sent" array,  
>> which by
>> default is initialized as an array with all entries = -1, but is  
>> populated by
>> MPI calls in another function. So it seems like the error leading  
>> to my crash
>> is somewhere in the PM3 mpi_source directory.
>>
>> Anyone familiar with that part of the code? I would upgrade to 3.2  
>> to see if
>> PM4 fixes the problem, but I am not sure if I can restart from my 3.0
>> checkpoint if I do that.
>
> James,
>
> Yes it is a good idea to upgrade to 3.2.  Quite a bit of experience  
> with
> running on large numbers of processors has gone into the code  
> development
> since version 3.0 was released.  So the cause of the problem may  
> well have
> been fixed, with either the Paramesh4.0 or the Paramesh4dev variants  
> of
> PARAMESH that come with FLASH3.2.
>
> The format of checkpoint files has changed slightly, but restarting
> with version 3.2 code from 3.0 checkpoint files should work fine.
> (Unless the grid state was somehow already corrupt when the checkpoint
> was written!)
>
> Klaus
>
> !DSPAM:10135,4aa1e1c6491021468!
>




More information about the flash-users mailing list