[FLASH-USERS] Open MPI error during refinement
Ryan Farber
rjfarber at umich.edu
Tue Oct 8 19:02:51 EDT 2019
Hi Claude,
Thanks for attaching the log file. I'm pretty sure you need >= 105
processors (whereas you used 103 in the log file you sent me), which may
explain why it's failing.
The end of the logfile shows paramesh is trying to make 10,265 blocks. With
maxblocks=100 you should theoretically have just enough cores but in
practice I've found FLASH needs 2% more blocks available than the maximum
value "requested." That's where the 105 comes from.
The magic 2% comes from lots of runs on Comet, Stampede2, and local
clusters. But that's all FLASH4.2.2 so the magic number may be a bit
different for FLASH4.5.
Best,
--------
Ryan
On Tue, Oct 8, 2019 at 2:03 PM Claude Cournoyer-Cloutier <
cournoyc at mcmaster.ca> wrote:
> Hi Ryan,
>
> Thank you for your quick answer. Here is a copy of the log file for that
> simulation. I might try to increase the available memory, but we are
> already getting fairly down the queue for our cluster with the current
> requests.
>
> Best,
>
> Claude
>
>
>
> On Oct 8, 2019, at 4:41 PM, Ryan Farber <rjfarber at umich.edu> wrote:
>
> Hi Claude,
>
> It may help to see your log file (or at least the last few lines of it).
> Typically when I have a run crash during refinement with no apparent
> cause, I increase available memory (increase nodes, reduce cores per node)
> and that fixes it.
>
> Best,
> --------
> Ryan
>
>
> On Tue, Oct 8, 2019 at 8:51 AM Claude Cournoyer-Cloutier <
> cournoyc at mcmaster.ca> wrote:
>
>> Dear FLASH users,
>>
>> I am using FLASH, coupled with the astrophysics code AMUSE, to model star
>> formation in clusters and cluster formation. I am able to run a test
>> version of my simulation, but encounter some issues with refinement when I
>> am trying to run a more involved version — with the same parameters, but a
>> higher possible refinement.
>>
>> During refinement, I get the following output (truncated) and error
>> message:
>>
>> refined: total blocks = 585
>> iteration, no. not moved = 0 323
>> iteration, no. not moved = 1 27
>> iteration, no. not moved = 2 0
>> refined: total leaf blocks = 2311
>> refined: total blocks = 2641
>> --------------------------------------------------------------------------
>> MPI_ABORT was invoked on rank 45 in communicator MPI COMMUNICATOR 4 SPLIT
>> FROM 0
>> with errorcode 4.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>>
>> From a few discussions I’ve had, I think the issue might have something
>> to do with the spreading of the blocks on different processors — but I
>> cannot find anything useful online. Have any of you encountered that error,
>> or a similar error in the past ?
>>
>> Best regards,
>>
>> Claude
>>
>> —
>>
>> Claude Cournoyer-Cloutier
>> Master’s Candidate, McMaster University
>> Department of Physics & Astronomy
>>
>>
>>
>>
> Claude Cournoyer-Cloutier
> Master’s Candidate, McMaster University
> Department of Physics & Astronomy
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20191008/d972869d/attachment-0001.htm>
More information about the flash-users
mailing list