<html>
<head>
<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Dear all,<br>
<br>
I have a restarting problem and hope that you can help me.<br>
<br>
I am, for the first time, running Flash on a BlueGene supercomputer
with 1024 cores and encountered a problem when restarting. My flash
version is 4-beta. The simulation ran fine and it created 2
checkpoint files. I wanted to restart from the second chekpoint
file, which has a filesize of 15 GB (15424362124 byte) and at the
moment the file is read, the simulation just stops. <br>
<br>
The last lines of the flash log file is the following:<br>
[ 01-11-2014 19:24:23.616 ] message: vsize (MB): 202.06
(min) 202.12 (max) 202.06 (avg)<br>
[ 01-11-2014 19:24:23.619 ] message: rss (MB): 1.67
(min) 1.67 (max) 1.67 (avg)<br>
[ 01-11-2014 19:24:23.628 ] [io_readData] file opened:
type=checkpoint name=xSHx_hdf5_chk_0002<br>
<br>
I get a lot of core file dumps, one per core (see attachement), and
terminal error output of the following:<br>
NumPartProps: 18<br>
NumPartProps: 18<br>
2014-01-11 20:02:48.793 (WARN ) [0x400011e91e0]
:749776:ibm.runjob.client.Job: terminated by signal 5<br>
2014-01-11 20:02:48.793 (WARN ) [0x400011e91e0]
:749776:ibm.runjob.client.Job: abnormal termination by signal 5 from
rank 208<br>
<br>
The supercomputer is the Italian supercomputer CINECA and has 16
cores nodes with 16 GB ram per node. My BGsize is 64 nodes with 16
ranks per node, meaning that I have 1 GB per core and 1024 cores in
total with 1 TB ram. <br>
<br>
The code is compiled with XLF compilers, i.e., mpixlf90 (not the
mpixlf90_r, of which I do not know the use) and mpixlc(xx). There
were several ".f" files that would not compile, so I solved it by
compiling them separately with mpixlf77. These files were
"fftsg2d.f", "fftsg3d.f", and "umap.F". I don't know how critical
that is, but the compilation is successful.<br>
<br>
Compiling the code in debug mode, i.e., with "<b>-g -qfullpath -O0
-qcheck</b>" instead of the normal "<b>-O3</b> -qintsize=4
-qrealsize=8 -c -qxlf90=autodealloc -qsuffix=cpp=F -qtune=auto
-qstrict -qarch=auto -qextname -qzerosize" showed some warnings:<br>
"io_writeData.F90", line 242.10: 1511-013 (W) The value of the
DO-loop increment should be negative when initial value is greater
than the terminal value.<br>
"io_writeData.F90", line 308.24: 1511-013 (W) The value of the
DO-loop increment should be negative when initial value is greater
than the terminal value.<br>
"io_writeData.F90", line 316.21: 1516-152 (S) Zero-sized arrays must
not be subscripted.<br>
"io_writeData.F90", line 469.22: 1511-013 (W) The value of the
DO-loop increment should be negative when initial value is greater
than the terminal value.<br>
"io_writeData.F90", line 480.29: 1516-152 (S) Zero-sized arrays must
not be subscripted.<br>
"io_writeData.F90", line 481.29: 1516-152 (S) Zero-sized arrays must
not be subscripted.<br>
"io_writeData.F90", line 482.29: 1516-152 (S) Zero-sized arrays must
not be subscripted.<br>
"io_writeData.F90", line 483.29: 1516-152 (S) Zero-sized arrays must
not be subscripted.<br>
"io_writeData.F90", line 491.34: 1516-152 (S) Zero-sized arrays must
not be subscripted.<br>
"io_writeData.F90", line 493.37: 1516-152 (S) Zero-sized arrays must
not be subscripted.<br>
"io_writeData.F90", line 494.37: 1516-152 (S) Zero-sized arrays must
not be subscripted.<br>
"io_writeData.F90", line 506.29: 1516-152 (S) Zero-sized arrays must
not be subscripted.<br>
"io_writeData.F90", line 578.24: 1511-013 (W) The value of the
DO-loop increment should be negative when initial value is greater
than the terminal value.<br>
"io_writeData.F90", line 744.10: 1511-013 (W) The value of the
DO-loop increment should be negative when initial value is greater
than the terminal value.<br>
** io_writedata === End of Compilation 1 ===<br>
1501-511 Compilation failed for file io_writeData.F90.<br>
make: *** [io_writeData.o] Error 1<br>
<br>
The errors also appear in "IO_init.F90"<br>
"IO_init.F90", line 157.13: 1511-013 (W) The value of the DO-loop
increment should be negative when initial value is greater than the
terminal value.<br>
"IO_init.F90", line 182.13: 1511-013 (W) The value of the DO-loop
increment should be negative when initial value is greater than the
terminal value.<br>
"IO_init.F90", line 249.11: 1511-013 (W) The value of the DO-loop
increment should be negative when initial value is greater than the
terminal value.<br>
"IO_init.F90", line 250.13: 1516-152 (S) Zero-sized arrays must not
be subscripted.<br>
"IO_init.F90", line 251.13: 1516-152 (S) Zero-sized arrays must not
be subscripted.<br>
"IO_init.F90", line 252.13: 1516-152 (S) Zero-sized arrays must not
be subscripted.<br>
"IO_init.F90", line 278.8: 1511-013 (W) The value of the DO-loop
increment should be negative when initial value is greater than the
terminal value.<br>
"IO_init.F90", line 279.37: 1516-152 (S) Zero-sized arrays must not
be subscripted.<br>
** io_init === End of Compilation 1 ===<br>
1501-511 Compilation failed for file IO_init.F90.<br>
<br>
There were no errors without the -qcheck option. So, I continued
without the -qcheck. The same crash happens with these compiler
options without any extra log information. These error/warning lines
might related to SCRATCH_GRID_VARS_* having end=0 and begin=1
values. I do not make use of the scratch array, so I assumed this is
not critical.<br>
<br>
Any help is appreciated.<br>
Best, <br>
Seyit<br>
<br>
<br>
<br>
<br>
</body>
</html>