[FLASH-USERS] 4000+ cpus on Franklin
Chris Daley
cdaley at flash.uchicago.edu
Tue Sep 8 10:41:28 EDT 2009
Hi James,
> 1 aborting job:
> 2 Fatal error in MPI_Irecv: Invalid tag, error stack:
> 3 MPI_Irecv(144): MPI_Irecv(buf=0x1373f760, count=378,
> MPI_INTEGER, src=4095, tag=16777312, MPI_COMM_WORLD,
> request=0x1377aea8) failed
> 4 MPI_Irecv(97).: Invalid tag, value is 16777312
It is very likely that the tags are overflowing. I have experienced
tag overflow errors on Jaguar when using > 4000 processes (both Jaguar
and Franklin are Cray XT systems).
Paramesh4dev will not overflow MPI tags. Alternatively, you can use
Paramesh4.0 but remove PPDEFINE PM_UNIQUE_MPI_TAGS from
Grid/GridMain/paramesh/paramesh4/Paramesh4.0/Config before setting up
your application. However, we have sometimes experienced deadlock
situations in this special mode with Paramesh4.0 (we're not sure why
yet). We've never encountered a deadlock with Paramesh4dev
(which uses this special mode by default), so I recommend
Paramesh4dev.
>
> I also tried switching the paramesh library to paramesh4dev...but for
> some reason it seems unable to find the amr_runtime_parameters file.
Simply copy amr_runtime_parameters from your object directory to your
run directory.
Regards,
Chris
James Guillochon wrote:
> OK, I've upgraded to FLASH 3.2...again, I can run the simulation on
> 2048 processors, but when I try 4096 I get the following errors:
>
> 1 aborting job:
> 2 Fatal error in MPI_Irecv: Invalid tag, error stack:
> 3 MPI_Irecv(144): MPI_Irecv(buf=0x1373f760, count=378,
> MPI_INTEGER, src=4095, tag=16777312, MPI_COMM_WORLD,
> request=0x1377aea8) failed
> 4 MPI_Irecv(97).: Invalid tag, value is 16777312
> 5 aborting job:
> 6 Fatal error in MPI_Irecv: Invalid tag, error stack:
> 7 MPI_Irecv(144): MPI_Irecv(buf=0x1373e040, count=378,
> MPI_INTEGER, src=4095, tag=16777315, MPI_COMM_WORLD,
> request=0x1377af18) failed
> 8 MPI_Irecv(97).: Invalid tag, value is 16777315
> 9 aborting job:
> 10 Fatal error in MPI_Irecv: Invalid tag, error stack:
> 11 MPI_Irecv(144): MPI_Irecv(buf=0x13788c98, count=378,
> MPI_INTEGER, src=4095, tag=16777313, MPI_COMM_WORLD,
> request=0x13741674) failed
> 12 MPI_Irecv(97).: Invalid tag, value is 16777313
> 13 aborting job:
> 14 Fatal error in MPI_Irecv: Invalid tag, error stack:
> 15 MPI_Irecv(144): MPI_Irecv(buf=0x1373e040, count=378,
> MPI_INTEGER, src=4095, tag=16777314, MPI_COMM_WORLD,
> request=0x1377af18) failed
> 16 MPI_Irecv(97).: Invalid tag, value is 16777314
> 17 aborting job:
> 18 Fatal error in MPI_Ssend: Invalid tag, error stack:
> 19 MPI_Ssend(167): MPI_Ssend(buf=0x1354d910, count=378,
> MPI_INTEGER, dest=4091, tag=16777312, MPI_COMM_WORLD) failed
> 20 MPI_Ssend(93).: Invalid tag, value is 16777312
> 21 [NID 12795]Apid 1253267: initiated application termination
>
> MPI_TAG_UB is about 2 billion on the machine I'm running on, so the
> tag numbers here don't seem to be out of range...
>
> I also tried switching the paramesh library to paramesh4dev...but for
> some reason it seems unable to find the amr_runtime_parameters file.
> I've tried setting "ParameshLibraryMode=true", but that doesn't seem
> to help. Here is the error:
>
> 1 PGFIO-F-209/OPEN/unit=35/'OLD' specified for file which does
> not exist.
> 2 File name = amr_runtime_parameters
> 3 In source file amr_set_runtime_parameters.F90, at line number 86
> 4 [NID 1593]Apid 1255265: initiated application termination
>
> Thanks in advance!
>
More information about the flash-users
mailing list