[FLASH-USERS] Restarting a Flash simulation on BlueGene

Sean Couch smc at flash.uchicago.edu
Wed Jan 15 08:53:39 EST 2014


Hi Seyit,

Are you using parallel IO?  What is your setup line?  You might try adding, separately, ‘+parallelIO’ and ‘+hdf5typeIO’ to your setup line and trying again.

Sean

--------------------------------------------------------
Sean M. Couch
Hubble Fellow
Flash Center for Computational Science
Department of Astronomy & Astrophysics
The University of Chicago
5747 S Ellis Ave, Jo 315
Chicago, IL  60637
(773) 702-3899
www.flash.uchicago.edu/~smc




On Jan 15, 2014, at 4:58 AM, Seyit Hocuk <seyit at astro.rug.nl> wrote:

> Dear all,
> 
> I have a restarting problem and hope that you can help me.
> 
> I am, for the first time, running Flash on a BlueGene supercomputer with 1024 cores and encountered a problem when restarting. My flash version is 4-beta. The simulation ran fine and it created 2 checkpoint files. I wanted to restart from the second chekpoint file, which has a filesize of 15 GB (15424362124 byte) and at the moment the file is read, the simulation just stops. 
> 
> The last lines of the flash log file is the following:
>  [ 01-11-2014  19:24:23.616 ] message: vsize (MB):       202.06 (min)        202.12 (max)        202.06 (avg)
>  [ 01-11-2014  19:24:23.619 ] message: rss   (MB):         1.67 (min)          1.67 (max)          1.67 (avg)
>  [ 01-11-2014  19:24:23.628 ] [io_readData] file opened: type=checkpoint name=xSHx_hdf5_chk_0002
> 
> I get a lot of core file dumps, one per core (see attachement), and terminal error output of the following:
> NumPartProps:  18
> NumPartProps:  18
> 2014-01-11 20:02:48.793 (WARN ) [0x400011e91e0] :749776:ibm.runjob.client.Job: terminated by signal 5
> 2014-01-11 20:02:48.793 (WARN ) [0x400011e91e0] :749776:ibm.runjob.client.Job: abnormal termination by signal 5 from rank 208
> 
> The supercomputer is the Italian supercomputer CINECA and has 16 cores nodes with 16 GB ram per node. My BGsize is 64 nodes with 16 ranks per node, meaning that I have 1 GB per core and 1024 cores in total with 1 TB ram. 
> 
> The code is compiled with XLF compilers, i.e., mpixlf90 (not the mpixlf90_r, of which I do not know the use) and mpixlc(xx). There were several ".f" files that would not compile, so I solved it by compiling them separately with mpixlf77. These files were "fftsg2d.f", "fftsg3d.f", and "umap.F". I don't know how critical that is, but the compilation is successful.
> 
> Compiling the code in debug mode, i.e., with "-g -qfullpath -O0 -qcheck" instead of the normal "-O3 -qintsize=4 -qrealsize=8 -c -qxlf90=autodealloc -qsuffix=cpp=F -qtune=auto -qstrict -qarch=auto -qextname -qzerosize" showed some warnings:
> "io_writeData.F90", line 242.10: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
> "io_writeData.F90", line 308.24: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
> "io_writeData.F90", line 316.21: 1516-152 (S) Zero-sized arrays must not be subscripted.
> "io_writeData.F90", line 469.22: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
> "io_writeData.F90", line 480.29: 1516-152 (S) Zero-sized arrays must not be subscripted.
> "io_writeData.F90", line 481.29: 1516-152 (S) Zero-sized arrays must not be subscripted.
> "io_writeData.F90", line 482.29: 1516-152 (S) Zero-sized arrays must not be subscripted.
> "io_writeData.F90", line 483.29: 1516-152 (S) Zero-sized arrays must not be subscripted.
> "io_writeData.F90", line 491.34: 1516-152 (S) Zero-sized arrays must not be subscripted.
> "io_writeData.F90", line 493.37: 1516-152 (S) Zero-sized arrays must not be subscripted.
> "io_writeData.F90", line 494.37: 1516-152 (S) Zero-sized arrays must not be subscripted.
> "io_writeData.F90", line 506.29: 1516-152 (S) Zero-sized arrays must not be subscripted.
> "io_writeData.F90", line 578.24: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
> "io_writeData.F90", line 744.10: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
> ** io_writedata   === End of Compilation 1 ===
> 1501-511  Compilation failed for file io_writeData.F90.
> make: *** [io_writeData.o] Error 1
> 
> The errors also appear in "IO_init.F90"
> "IO_init.F90", line 157.13: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
> "IO_init.F90", line 182.13: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
> "IO_init.F90", line 249.11: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
> "IO_init.F90", line 250.13: 1516-152 (S) Zero-sized arrays must not be subscripted.
> "IO_init.F90", line 251.13: 1516-152 (S) Zero-sized arrays must not be subscripted.
> "IO_init.F90", line 252.13: 1516-152 (S) Zero-sized arrays must not be subscripted.
> "IO_init.F90", line 278.8: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
> "IO_init.F90", line 279.37: 1516-152 (S) Zero-sized arrays must not be subscripted.
> ** io_init   === End of Compilation 1 ===
> 1501-511  Compilation failed for file IO_init.F90.
> 
> There were no errors without the -qcheck option. So, I continued without the -qcheck. The same crash happens with these compiler options without any extra log information. These error/warning lines might related to SCRATCH_GRID_VARS_* having end=0 and begin=1 values. I do not make use of the scratch array, so I assumed this is not critical.
> 
> Any help is appreciated.
> Best, 
> Seyit
> 
> 
> 
> 
> <core.602><seyit.vcf>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20140115/f35f9f35/attachment.htm>


More information about the flash-users mailing list