[FLASH-USERS] Restarting a Flash simulation on BlueGene
Seyit Hocuk
seyit at astro.rug.nl
Wed Jan 15 09:12:17 EST 2014
Hi Sean,
Thanks for your response.
No I do not have parallel IO. So, just implementing this flag would
help? Any libraries needed? I will surely try it in that case.
Expanded Command line:
--with-library=mpi --with-unit=IO --unit=Grid
--gridinterpolation=monotonic SH-dust --auto --portable --3d
--maxblocks=200 --objdir=ss-dust-3
I use serial hdf5 by the way. It is my experience that I do not lose
much time writing a checkpoint file, so I did not think parallel hdf5
was necessary. However, I am quite dissapointed in the overall
simulation speed. Could it also boost simulation speed, because a lot of
information is passed between (1024) processors for the regular
calculations and I think most of the time is lost there.
Kind regards,
Seyit
On 01/15/2014 02:53 PM, Sean Couch wrote:
> Hi Seyit,
>
> Are you using parallel IO? What is your setup line? You might try
> adding, separately, ‘+parallelIO’ and ‘+hdf5typeIO’ to your setup line
> and trying again.
>
> Sean
>
> --------------------------------------------------------
> Sean M. Couch
> Hubble Fellow
> Flash Center for Computational Science
> Department of Astronomy & Astrophysics
> The University of Chicago
> 5747 S Ellis Ave, Jo 315
> Chicago, IL 60637
> (773) 702-3899
> www.flash.uchicago.edu/~smc <http://www.flash.uchicago.edu/%7Esmc>
>
>
>
>
> On Jan 15, 2014, at 4:58 AM, Seyit Hocuk <seyit at astro.rug.nl
> <mailto:seyit at astro.rug.nl>> wrote:
>
>> Dear all,
>>
>> I have a restarting problem and hope that you can help me.
>>
>> I am, for the first time, running Flash on a BlueGene supercomputer
>> with 1024 cores and encountered a problem when restarting. My flash
>> version is 4-beta. The simulation ran fine and it created 2
>> checkpoint files. I wanted to restart from the second chekpoint file,
>> which has a filesize of 15 GB (15424362124 byte) and at the moment
>> the file is read, the simulation just stops.
>>
>> The last lines of the flash log file is the following:
>> [ 01-11-2014 19:24:23.616 ] message: vsize (MB): 202.06
>> (min) 202.12 (max) 202.06 (avg)
>> [ 01-11-2014 19:24:23.619 ] message: rss (MB): 1.67
>> (min) 1.67 (max) 1.67 (avg)
>> [ 01-11-2014 19:24:23.628 ] [io_readData] file opened:
>> type=checkpoint name=xSHx_hdf5_chk_0002
>>
>> I get a lot of core file dumps, one per core (see attachement), and
>> terminal error output of the following:
>> NumPartProps: 18
>> NumPartProps: 18
>> 2014-01-11 20:02:48.793 (WARN ) [0x400011e91e0]
>> :749776:ibm.runjob.client.Job: terminated by signal 5
>> 2014-01-11 20:02:48.793 (WARN ) [0x400011e91e0]
>> :749776:ibm.runjob.client.Job: abnormal termination by signal 5 from
>> rank 208
>>
>> The supercomputer is the Italian supercomputer CINECA and has 16
>> cores nodes with 16 GB ram per node. My BGsize is 64 nodes with 16
>> ranks per node, meaning that I have 1 GB per core and 1024 cores in
>> total with 1 TB ram.
>>
>> The code is compiled with XLF compilers, i.e., mpixlf90 (not the
>> mpixlf90_r, of which I do not know the use) and mpixlc(xx). There
>> were several ".f" files that would not compile, so I solved it by
>> compiling them separately with mpixlf77. These files were
>> "fftsg2d.f", "fftsg3d.f", and "umap.F". I don't know how critical
>> that is, but the compilation is successful.
>>
>> Compiling the code in debug mode, i.e., with "*-g -qfullpath -O0
>> -qcheck*" instead of the normal "*-O3* -qintsize=4 -qrealsize=8 -c
>> -qxlf90=autodealloc -qsuffix=cpp=F -qtune=auto -qstrict -qarch=auto
>> -qextname -qzerosize" showed some warnings:
>> "io_writeData.F90", line 242.10: 1511-013 (W) The value of the
>> DO-loop increment should be negative when initial value is greater
>> than the terminal value.
>> "io_writeData.F90", line 308.24: 1511-013 (W) The value of the
>> DO-loop increment should be negative when initial value is greater
>> than the terminal value.
>> "io_writeData.F90", line 316.21: 1516-152 (S) Zero-sized arrays must
>> not be subscripted.
>> "io_writeData.F90", line 469.22: 1511-013 (W) The value of the
>> DO-loop increment should be negative when initial value is greater
>> than the terminal value.
>> "io_writeData.F90", line 480.29: 1516-152 (S) Zero-sized arrays must
>> not be subscripted.
>> "io_writeData.F90", line 481.29: 1516-152 (S) Zero-sized arrays must
>> not be subscripted.
>> "io_writeData.F90", line 482.29: 1516-152 (S) Zero-sized arrays must
>> not be subscripted.
>> "io_writeData.F90", line 483.29: 1516-152 (S) Zero-sized arrays must
>> not be subscripted.
>> "io_writeData.F90", line 491.34: 1516-152 (S) Zero-sized arrays must
>> not be subscripted.
>> "io_writeData.F90", line 493.37: 1516-152 (S) Zero-sized arrays must
>> not be subscripted.
>> "io_writeData.F90", line 494.37: 1516-152 (S) Zero-sized arrays must
>> not be subscripted.
>> "io_writeData.F90", line 506.29: 1516-152 (S) Zero-sized arrays must
>> not be subscripted.
>> "io_writeData.F90", line 578.24: 1511-013 (W) The value of the
>> DO-loop increment should be negative when initial value is greater
>> than the terminal value.
>> "io_writeData.F90", line 744.10: 1511-013 (W) The value of the
>> DO-loop increment should be negative when initial value is greater
>> than the terminal value.
>> ** io_writedata === End of Compilation 1 ===
>> 1501-511 Compilation failed for file io_writeData.F90.
>> make: *** [io_writeData.o] Error 1
>>
>> The errors also appear in "IO_init.F90"
>> "IO_init.F90", line 157.13: 1511-013 (W) The value of the DO-loop
>> increment should be negative when initial value is greater than the
>> terminal value.
>> "IO_init.F90", line 182.13: 1511-013 (W) The value of the DO-loop
>> increment should be negative when initial value is greater than the
>> terminal value.
>> "IO_init.F90", line 249.11: 1511-013 (W) The value of the DO-loop
>> increment should be negative when initial value is greater than the
>> terminal value.
>> "IO_init.F90", line 250.13: 1516-152 (S) Zero-sized arrays must not
>> be subscripted.
>> "IO_init.F90", line 251.13: 1516-152 (S) Zero-sized arrays must not
>> be subscripted.
>> "IO_init.F90", line 252.13: 1516-152 (S) Zero-sized arrays must not
>> be subscripted.
>> "IO_init.F90", line 278.8: 1511-013 (W) The value of the DO-loop
>> increment should be negative when initial value is greater than the
>> terminal value.
>> "IO_init.F90", line 279.37: 1516-152 (S) Zero-sized arrays must not
>> be subscripted.
>> ** io_init === End of Compilation 1 ===
>> 1501-511 Compilation failed for file IO_init.F90.
>>
>> There were no errors without the -qcheck option. So, I continued
>> without the -qcheck. The same crash happens with these compiler
>> options without any extra log information. These error/warning lines
>> might related to SCRATCH_GRID_VARS_* having end=0 and begin=1 values.
>> I do not make use of the scratch array, so I assumed this is not
>> critical.
>>
>> Any help is appreciated.
>> Best,
>> Seyit
>>
>>
>>
>>
>> <core.602><seyit.vcf>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20140115/72e635e1/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: seyit.vcf
Type: text/x-vcard
Size: 309 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20140115/72e635e1/attachment-0001.vcf>
More information about the flash-users
mailing list