[FLASH-USERS] 4000+ cpus on Franklin

Chris Daley cdaley at flash.uchicago.edu
Tue Sep 8 10:41:28 EDT 2009


Hi James,

>       1 aborting job:
>       2 Fatal error in MPI_Irecv: Invalid tag, error stack:
>       3 MPI_Irecv(144): MPI_Irecv(buf=0x1373f760, count=378, 
> MPI_INTEGER, src=4095, tag=16777312, MPI_COMM_WORLD, 
> request=0x1377aea8) failed
>       4 MPI_Irecv(97).: Invalid tag, value is 16777312  

It is very likely that the tags are overflowing.  I have experienced
tag overflow errors on Jaguar when using > 4000 processes (both Jaguar
and Franklin are Cray XT systems). 

Paramesh4dev will not overflow MPI tags.  Alternatively, you can use
Paramesh4.0 but remove PPDEFINE PM_UNIQUE_MPI_TAGS from
Grid/GridMain/paramesh/paramesh4/Paramesh4.0/Config before setting up
your application.  However, we have sometimes experienced deadlock
situations in this special mode with Paramesh4.0 (we're not sure why
yet).  We've never encountered a deadlock with Paramesh4dev
(which uses this special mode by default), so I recommend
Paramesh4dev.

>
> I also tried switching the paramesh library to paramesh4dev...but for 
> some reason it seems unable to find the amr_runtime_parameters file.

Simply copy amr_runtime_parameters from your object directory to your 
run directory.



Regards,
Chris


James Guillochon wrote:
> OK, I've upgraded to FLASH 3.2...again, I can run the simulation on 
> 2048 processors, but when I try 4096 I get the following errors:
>
>       1 aborting job:
>       2 Fatal error in MPI_Irecv: Invalid tag, error stack:
>       3 MPI_Irecv(144): MPI_Irecv(buf=0x1373f760, count=378, 
> MPI_INTEGER, src=4095, tag=16777312, MPI_COMM_WORLD, 
> request=0x1377aea8) failed
>       4 MPI_Irecv(97).: Invalid tag, value is 16777312
>       5 aborting job:
>       6 Fatal error in MPI_Irecv: Invalid tag, error stack:
>       7 MPI_Irecv(144): MPI_Irecv(buf=0x1373e040, count=378, 
> MPI_INTEGER, src=4095, tag=16777315, MPI_COMM_WORLD, 
> request=0x1377af18) failed
>       8 MPI_Irecv(97).: Invalid tag, value is 16777315
>       9 aborting job:
>      10 Fatal error in MPI_Irecv: Invalid tag, error stack:
>      11 MPI_Irecv(144): MPI_Irecv(buf=0x13788c98, count=378, 
> MPI_INTEGER, src=4095, tag=16777313, MPI_COMM_WORLD, 
> request=0x13741674) failed
>      12 MPI_Irecv(97).: Invalid tag, value is 16777313
>      13 aborting job:
>      14 Fatal error in MPI_Irecv: Invalid tag, error stack:
>      15 MPI_Irecv(144): MPI_Irecv(buf=0x1373e040, count=378, 
> MPI_INTEGER, src=4095, tag=16777314, MPI_COMM_WORLD, 
> request=0x1377af18) failed
>      16 MPI_Irecv(97).: Invalid tag, value is 16777314
>      17 aborting job:
>      18 Fatal error in MPI_Ssend: Invalid tag, error stack:
>      19 MPI_Ssend(167): MPI_Ssend(buf=0x1354d910, count=378, 
> MPI_INTEGER, dest=4091, tag=16777312, MPI_COMM_WORLD) failed
>      20 MPI_Ssend(93).: Invalid tag, value is 16777312
>      21 [NID 12795]Apid 1253267: initiated application termination
>
> MPI_TAG_UB is about 2 billion on the machine I'm running on, so the 
> tag numbers here don't seem to be out of range...
>
> I also tried switching the paramesh library to paramesh4dev...but for 
> some reason it seems unable to find the amr_runtime_parameters file. 
> I've tried setting "ParameshLibraryMode=true", but that doesn't seem 
> to help. Here is the error:
>
>       1 PGFIO-F-209/OPEN/unit=35/'OLD' specified for file which does 
> not exist.
>       2  File name = amr_runtime_parameters
>       3  In source file amr_set_runtime_parameters.F90, at line number 86
>       4 [NID 1593]Apid 1255265: initiated application termination
>
> Thanks in advance!
>




More information about the flash-users mailing list