[FLASH-USERS] code crash and MPI_BUFS_PER_PROC
Mateusz Ruszkowski
mateuszr at umich.edu
Fri Nov 25 15:58:20 EST 2011
Hi,
I am having trouble running a job that refuses to proceed past the stage
of writing 0th hdf5 checkpoint file. The last line in the log file is:
[ 11-25-2011 11:55:40.097 ] [IO_writeCheckpoint] open: type=checkpoint
name=Test_hdf5_chk_0000
The key parts of the error file are:
-----------------------
Job 73366.pbspl1.nas.nasa.gov started on Fri Nov 25 11:49:53 PST 2011
The job requested the following resources:
ncpus=4104
place=scatter:excl
walltime=24:00:00
PBS set the following environment variables:
FORT_BUFFERED = 1
TZ = PST8PDT
On r148i1n4:
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Current directory is
/nobackupp1/mruszkow/Test
MPI WARNING: Could not allocate an internal buffer in the last 30 seconds.
Try increasing MPI_BUFS_PER_PROC and/or MPI_BUFS_PER_HOST.
...
... warning repeated many times
...
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libc.so.6 00002AAAAAF52B8B Unknown Unknown
Unknown
libmpi.so 00002AAAAB2FBC69 Unknown Unknown
Unknown
MPI: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
MPI: aborting job
-------------------------
The job was ran in the debug mode. Compiler options were:
f compiler flags:
ifort -c -g -r8 -i4 -check bounds -check format -check output_conversion -warn
all -real_size 64 -traceback -mcmodel=medium -DMAXBLOCKS=200 -DNXB=16 -DNYB=16 -DNZB=8 -DN_DIM=3
c compiler flags:
icc -I /nasa/hdf5/1.8.0/parallel/include -DH5_USE_16_API -I/nasa/sgi/mpt/2.01/include
-c -g -debug extended -D_LARGEFILE64_SOURCE -DMAXBLOCKS=200 -DNXB=16 -DNYB=16 -DNZB=8 -DN_DIM=3
and the IO was serial. Btw, I also tried a setup with +parallelio but it
did not work either.
I set the environmental variables to values higher than default SGI MPT:
%env $MPI_BUFS_PER_PROC
env: 256: No such file or directory
%env $MPI_BUFS_PER_HOST
env: 256: No such file or directory
and yet the warning message related to these variables did not go away
"MPI WARNING: Could not allocate an internal buffer in the last 30 seconds.
Try increasing MPI_BUFS_PER_PROC and/or MPI_BUFS_PER_HOST." and the code
crashed.
By the way, the same setup works fine for lower resolution (by 1 lev. of
refinement less and on 1024 rather than 4096 processors).
Does anybody have an idea what the cause of this error may be? Seems that
the problem is memory related.
Thanks,
Mateusz
More information about the flash-users
mailing list