[FLASH-USERS] Restarting a Flash simulation on BlueGene

Sean Couch smc at flash.uchicago.edu
Wed Jan 15 09:15:52 EST 2014


Seyit,

So long as your hdf5 library has been compiled with parallel support, you don’t need any other extra libraries.  In my experience on BG systems (both P and Q), parallel IO is absolutely necessary.  Keep in mind also that BG systems have low processor clock rates so your simulations might seem “slow” relative to other clusters, but the trade-off is incredibly fast communication on BG.  In other words, use more cores if your sims are too slow!  I find FLASH strong (and weak) scales extremely well on BG.

Sean



--------------------------------------------------------
Sean M. Couch
Hubble Fellow
Flash Center for Computational Science
Department of Astronomy & Astrophysics
The University of Chicago
5747 S Ellis Ave, Jo 315
Chicago, IL  60637
(773) 702-3899
www.flash.uchicago.edu/~smc




On Jan 15, 2014, at 8:12 AM, Seyit Hocuk <seyit at astro.rug.nl> wrote:

> Hi Sean,
> 
> Thanks for your response. 
> No I do not have parallel IO. So, just implementing this flag would help? Any libraries needed? I will surely try it in that case.
> 
> Expanded Command line:
> --with-library=mpi --with-unit=IO --unit=Grid --gridinterpolation=monotonic SH-dust --auto --portable --3d --maxblocks=200 --objdir=ss-dust-3
> 
> I use serial hdf5 by the way. It is my experience that I do not lose much time writing a checkpoint file, so I did not think parallel hdf5 was necessary. However, I am quite dissapointed in the overall simulation speed. Could it also boost simulation speed, because a lot of information is passed between (1024) processors for the regular calculations and I think most of the time is lost there. 
> 
> Kind regards,
> Seyit
> 
> 
> 
> On 01/15/2014 02:53 PM, Sean Couch wrote:
>> Hi Seyit,
>> 
>> Are you using parallel IO?  What is your setup line?  You might try adding, separately, ‘+parallelIO’ and ‘+hdf5typeIO’ to your setup line and trying again.
>> 
>> Sean
>> 
>> --------------------------------------------------------
>> Sean M. Couch
>> Hubble Fellow
>> Flash Center for Computational Science
>> Department of Astronomy & Astrophysics
>> The University of Chicago
>> 5747 S Ellis Ave, Jo 315
>> Chicago, IL  60637
>> (773) 702-3899
>> www.flash.uchicago.edu/~smc
>> 
>> 
>> 
>> 
>> On Jan 15, 2014, at 4:58 AM, Seyit Hocuk <seyit at astro.rug.nl> wrote:
>> 
>>> Dear all,
>>> 
>>> I have a restarting problem and hope that you can help me.
>>> 
>>> I am, for the first time, running Flash on a BlueGene supercomputer with 1024 cores and encountered a problem when restarting. My flash version is 4-beta. The simulation ran fine and it created 2 checkpoint files. I wanted to restart from the second chekpoint file, which has a filesize of 15 GB (15424362124 byte) and at the moment the file is read, the simulation just stops. 
>>> 
>>> The last lines of the flash log file is the following:
>>>  [ 01-11-2014  19:24:23.616 ] message: vsize (MB):       202.06 (min)        202.12 (max)        202.06 (avg)
>>>  [ 01-11-2014  19:24:23.619 ] message: rss   (MB):         1.67 (min)          1.67 (max)          1.67 (avg)
>>>  [ 01-11-2014  19:24:23.628 ] [io_readData] file opened: type=checkpoint name=xSHx_hdf5_chk_0002
>>> 
>>> I get a lot of core file dumps, one per core (see attachement), and terminal error output of the following:
>>> NumPartProps:  18
>>> NumPartProps:  18
>>> 2014-01-11 20:02:48.793 (WARN ) [0x400011e91e0] :749776:ibm.runjob.client.Job: terminated by signal 5
>>> 2014-01-11 20:02:48.793 (WARN ) [0x400011e91e0] :749776:ibm.runjob.client.Job: abnormal termination by signal 5 from rank 208
>>> 
>>> The supercomputer is the Italian supercomputer CINECA and has 16 cores nodes with 16 GB ram per node. My BGsize is 64 nodes with 16 ranks per node, meaning that I have 1 GB per core and 1024 cores in total with 1 TB ram. 
>>> 
>>> The code is compiled with XLF compilers, i.e., mpixlf90 (not the mpixlf90_r, of which I do not know the use) and mpixlc(xx). There were several ".f" files that would not compile, so I solved it by compiling them separately with mpixlf77. These files were "fftsg2d.f", "fftsg3d.f", and "umap.F". I don't know how critical that is, but the compilation is successful.
>>> 
>>> Compiling the code in debug mode, i.e., with "-g -qfullpath -O0 -qcheck" instead of the normal "-O3 -qintsize=4 -qrealsize=8 -c -qxlf90=autodealloc -qsuffix=cpp=F -qtune=auto -qstrict -qarch=auto -qextname -qzerosize" showed some warnings:
>>> "io_writeData.F90", line 242.10: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
>>> "io_writeData.F90", line 308.24: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
>>> "io_writeData.F90", line 316.21: 1516-152 (S) Zero-sized arrays must not be subscripted.
>>> "io_writeData.F90", line 469.22: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
>>> "io_writeData.F90", line 480.29: 1516-152 (S) Zero-sized arrays must not be subscripted.
>>> "io_writeData.F90", line 481.29: 1516-152 (S) Zero-sized arrays must not be subscripted.
>>> "io_writeData.F90", line 482.29: 1516-152 (S) Zero-sized arrays must not be subscripted.
>>> "io_writeData.F90", line 483.29: 1516-152 (S) Zero-sized arrays must not be subscripted.
>>> "io_writeData.F90", line 491.34: 1516-152 (S) Zero-sized arrays must not be subscripted.
>>> "io_writeData.F90", line 493.37: 1516-152 (S) Zero-sized arrays must not be subscripted.
>>> "io_writeData.F90", line 494.37: 1516-152 (S) Zero-sized arrays must not be subscripted.
>>> "io_writeData.F90", line 506.29: 1516-152 (S) Zero-sized arrays must not be subscripted.
>>> "io_writeData.F90", line 578.24: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
>>> "io_writeData.F90", line 744.10: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
>>> ** io_writedata   === End of Compilation 1 ===
>>> 1501-511  Compilation failed for file io_writeData.F90.
>>> make: *** [io_writeData.o] Error 1
>>> 
>>> The errors also appear in "IO_init.F90"
>>> "IO_init.F90", line 157.13: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
>>> "IO_init.F90", line 182.13: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
>>> "IO_init.F90", line 249.11: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
>>> "IO_init.F90", line 250.13: 1516-152 (S) Zero-sized arrays must not be subscripted.
>>> "IO_init.F90", line 251.13: 1516-152 (S) Zero-sized arrays must not be subscripted.
>>> "IO_init.F90", line 252.13: 1516-152 (S) Zero-sized arrays must not be subscripted.
>>> "IO_init.F90", line 278.8: 1511-013 (W) The value of the DO-loop increment should be negative when initial value is greater than the terminal value.
>>> "IO_init.F90", line 279.37: 1516-152 (S) Zero-sized arrays must not be subscripted.
>>> ** io_init   === End of Compilation 1 ===
>>> 1501-511  Compilation failed for file IO_init.F90.
>>> 
>>> There were no errors without the -qcheck option. So, I continued without the -qcheck. The same crash happens with these compiler options without any extra log information. These error/warning lines might related to SCRATCH_GRID_VARS_* having end=0 and begin=1 values. I do not make use of the scratch array, so I assumed this is not critical.
>>> 
>>> Any help is appreciated.
>>> Best, 
>>> Seyit
>>> 
>>> 
>>> 
>>> 
>>> <core.602><seyit.vcf>
>> 
> 
> <seyit.vcf>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20140115/3f5b0eb7/attachment.htm>


More information about the flash-users mailing list