[FLASH-USERS] code crash and MPI_BUFS_PER_PROC

Mateusz Ruszkowski mateuszr at umich.edu
Fri Nov 25 15:58:20 EST 2011



   Hi,

I am having trouble running a job that refuses to proceed past the stage 
of writing 0th hdf5 checkpoint file. The last line in the log file is:

  [ 11-25-2011  11:55:40.097 ] [IO_writeCheckpoint] open: type=checkpoint 
name=Test_hdf5_chk_0000

The key parts of the error file are:


-----------------------

Job 73366.pbspl1.nas.nasa.gov started on Fri Nov 25 11:49:53 PST 2011
The job requested the following resources:
     ncpus=4104
     place=scatter:excl
     walltime=24:00:00

PBS set the following environment variables:
         FORT_BUFFERED = 1
                    TZ = PST8PDT

On r148i1n4:
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Current directory is 
/nobackupp1/mruszkow/Test
MPI WARNING: Could not allocate an internal buffer in the last 30 seconds.
Try increasing MPI_BUFS_PER_PROC and/or MPI_BUFS_PER_HOST.

...
... warning repeated many times
...

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libc.so.6          00002AAAAAF52B8B  Unknown               Unknown 
Unknown
libmpi.so          00002AAAAB2FBC69  Unknown               Unknown 
Unknown
MPI: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
MPI: aborting job

-------------------------


The job was ran in the debug mode. Compiler options were:

  f compiler flags:
  ifort -c -g -r8 -i4 -check bounds -check format -check output_conversion -warn
  all -real_size 64 -traceback -mcmodel=medium -DMAXBLOCKS=200 -DNXB=16 -DNYB=16 -DNZB=8 -DN_DIM=3

  c compiler flags:
  icc -I /nasa/hdf5/1.8.0/parallel/include -DH5_USE_16_API -I/nasa/sgi/mpt/2.01/include
-c -g -debug extended -D_LARGEFILE64_SOURCE -DMAXBLOCKS=200 -DNXB=16 -DNYB=16 -DNZB=8 -DN_DIM=3

and the IO was serial. Btw, I also tried a setup with +parallelio but it 
did not work either.


I set the environmental variables to values higher than default SGI MPT:

%env $MPI_BUFS_PER_PROC
env: 256: No such file or directory
%env $MPI_BUFS_PER_HOST
env: 256: No such file or directory

and yet the warning message related to these variables did not go away 
"MPI WARNING: Could not allocate an internal buffer in the last 30 seconds.
Try increasing MPI_BUFS_PER_PROC and/or MPI_BUFS_PER_HOST." and the code 
crashed.

By the way, the same setup works fine for lower resolution (by 1 lev. of 
refinement less and on 1024 rather than 4096 processors).


Does anybody have an idea what the cause of this error may be? Seems that 
the problem is memory related.


    Thanks,
      Mateusz







More information about the flash-users mailing list