[FLASH-USERS] Restarting a Flash simulation on BlueGene

Seyit Hocuk seyit at astro.rug.nl
Wed Jan 15 09:12:17 EST 2014


Hi Sean,

Thanks for your response.
No I do not have parallel IO. So, just implementing this flag would 
help? Any libraries needed? I will surely try it in that case.

Expanded Command line:
--with-library=mpi --with-unit=IO --unit=Grid 
--gridinterpolation=monotonic SH-dust --auto --portable --3d 
--maxblocks=200 --objdir=ss-dust-3

I use serial hdf5 by the way. It is my experience that I do not lose 
much time writing a checkpoint file, so I did not think parallel hdf5 
was necessary. However, I am quite dissapointed in the overall 
simulation speed. Could it also boost simulation speed, because a lot of 
information is passed between (1024) processors for the regular 
calculations and I think most of the time is lost there.

Kind regards,
Seyit



On 01/15/2014 02:53 PM, Sean Couch wrote:
> Hi Seyit,
>
> Are you using parallel IO?  What is your setup line?  You might try 
> adding, separately, ‘+parallelIO’ and ‘+hdf5typeIO’ to your setup line 
> and trying again.
>
> Sean
>
> --------------------------------------------------------
> Sean M. Couch
> Hubble Fellow
> Flash Center for Computational Science
> Department of Astronomy & Astrophysics
> The University of Chicago
> 5747 S Ellis Ave, Jo 315
> Chicago, IL  60637
> (773) 702-3899
> www.flash.uchicago.edu/~smc <http://www.flash.uchicago.edu/%7Esmc>
>
>
>
>
> On Jan 15, 2014, at 4:58 AM, Seyit Hocuk <seyit at astro.rug.nl 
> <mailto:seyit at astro.rug.nl>> wrote:
>
>> Dear all,
>>
>> I have a restarting problem and hope that you can help me.
>>
>> I am, for the first time, running Flash on a BlueGene supercomputer 
>> with 1024 cores and encountered a problem when restarting. My flash 
>> version is 4-beta. The simulation ran fine and it created 2 
>> checkpoint files. I wanted to restart from the second chekpoint file, 
>> which has a filesize of 15 GB (15424362124 byte) and at the moment 
>> the file is read, the simulation just stops.
>>
>> The last lines of the flash log file is the following:
>>  [ 01-11-2014  19:24:23.616 ] message: vsize (MB): 202.06 
>> (min)        202.12 (max)        202.06 (avg)
>>  [ 01-11-2014  19:24:23.619 ] message: rss   (MB): 1.67 
>> (min)          1.67 (max)          1.67 (avg)
>>  [ 01-11-2014  19:24:23.628 ] [io_readData] file opened: 
>> type=checkpoint name=xSHx_hdf5_chk_0002
>>
>> I get a lot of core file dumps, one per core (see attachement), and 
>> terminal error output of the following:
>> NumPartProps:  18
>> NumPartProps:  18
>> 2014-01-11 20:02:48.793 (WARN ) [0x400011e91e0] 
>> :749776:ibm.runjob.client.Job: terminated by signal 5
>> 2014-01-11 20:02:48.793 (WARN ) [0x400011e91e0] 
>> :749776:ibm.runjob.client.Job: abnormal termination by signal 5 from 
>> rank 208
>>
>> The supercomputer is the Italian supercomputer CINECA and has 16 
>> cores nodes with 16 GB ram per node. My BGsize is 64 nodes with 16 
>> ranks per node, meaning that I have 1 GB per core and 1024 cores in 
>> total with 1 TB ram.
>>
>> The code is compiled with XLF compilers, i.e., mpixlf90 (not the 
>> mpixlf90_r, of which I do not know the use) and mpixlc(xx). There 
>> were several ".f" files that would not compile, so I solved it by 
>> compiling them separately with mpixlf77. These files were 
>> "fftsg2d.f", "fftsg3d.f", and "umap.F". I don't know how critical 
>> that is, but the compilation is successful.
>>
>> Compiling the code in debug mode, i.e., with "*-g -qfullpath -O0 
>> -qcheck*" instead of the normal "*-O3* -qintsize=4 -qrealsize=8 -c 
>> -qxlf90=autodealloc -qsuffix=cpp=F -qtune=auto -qstrict -qarch=auto 
>> -qextname -qzerosize" showed some warnings:
>> "io_writeData.F90", line 242.10: 1511-013 (W) The value of the 
>> DO-loop increment should be negative when initial value is greater 
>> than the terminal value.
>> "io_writeData.F90", line 308.24: 1511-013 (W) The value of the 
>> DO-loop increment should be negative when initial value is greater 
>> than the terminal value.
>> "io_writeData.F90", line 316.21: 1516-152 (S) Zero-sized arrays must 
>> not be subscripted.
>> "io_writeData.F90", line 469.22: 1511-013 (W) The value of the 
>> DO-loop increment should be negative when initial value is greater 
>> than the terminal value.
>> "io_writeData.F90", line 480.29: 1516-152 (S) Zero-sized arrays must 
>> not be subscripted.
>> "io_writeData.F90", line 481.29: 1516-152 (S) Zero-sized arrays must 
>> not be subscripted.
>> "io_writeData.F90", line 482.29: 1516-152 (S) Zero-sized arrays must 
>> not be subscripted.
>> "io_writeData.F90", line 483.29: 1516-152 (S) Zero-sized arrays must 
>> not be subscripted.
>> "io_writeData.F90", line 491.34: 1516-152 (S) Zero-sized arrays must 
>> not be subscripted.
>> "io_writeData.F90", line 493.37: 1516-152 (S) Zero-sized arrays must 
>> not be subscripted.
>> "io_writeData.F90", line 494.37: 1516-152 (S) Zero-sized arrays must 
>> not be subscripted.
>> "io_writeData.F90", line 506.29: 1516-152 (S) Zero-sized arrays must 
>> not be subscripted.
>> "io_writeData.F90", line 578.24: 1511-013 (W) The value of the 
>> DO-loop increment should be negative when initial value is greater 
>> than the terminal value.
>> "io_writeData.F90", line 744.10: 1511-013 (W) The value of the 
>> DO-loop increment should be negative when initial value is greater 
>> than the terminal value.
>> ** io_writedata   === End of Compilation 1 ===
>> 1501-511  Compilation failed for file io_writeData.F90.
>> make: *** [io_writeData.o] Error 1
>>
>> The errors also appear in "IO_init.F90"
>> "IO_init.F90", line 157.13: 1511-013 (W) The value of the DO-loop 
>> increment should be negative when initial value is greater than the 
>> terminal value.
>> "IO_init.F90", line 182.13: 1511-013 (W) The value of the DO-loop 
>> increment should be negative when initial value is greater than the 
>> terminal value.
>> "IO_init.F90", line 249.11: 1511-013 (W) The value of the DO-loop 
>> increment should be negative when initial value is greater than the 
>> terminal value.
>> "IO_init.F90", line 250.13: 1516-152 (S) Zero-sized arrays must not 
>> be subscripted.
>> "IO_init.F90", line 251.13: 1516-152 (S) Zero-sized arrays must not 
>> be subscripted.
>> "IO_init.F90", line 252.13: 1516-152 (S) Zero-sized arrays must not 
>> be subscripted.
>> "IO_init.F90", line 278.8: 1511-013 (W) The value of the DO-loop 
>> increment should be negative when initial value is greater than the 
>> terminal value.
>> "IO_init.F90", line 279.37: 1516-152 (S) Zero-sized arrays must not 
>> be subscripted.
>> ** io_init   === End of Compilation 1 ===
>> 1501-511  Compilation failed for file IO_init.F90.
>>
>> There were no errors without the -qcheck option. So, I continued 
>> without the -qcheck. The same crash happens with these compiler 
>> options without any extra log information. These error/warning lines 
>> might related to SCRATCH_GRID_VARS_* having end=0 and begin=1 values. 
>> I do not make use of the scratch array, so I assumed this is not 
>> critical.
>>
>> Any help is appreciated.
>> Best,
>> Seyit
>>
>>
>>
>>
>> <core.602><seyit.vcf>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20140115/72e635e1/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: seyit.vcf
Type: text/x-vcard
Size: 309 bytes
Desc: not available
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20140115/72e635e1/attachment-0001.vcf>


More information about the flash-users mailing list