[FLASH-USERS] [EXT] Code crash when moving to an AMD based machine

Haakon Andresen haakon.andresen at astro.su.se
Mon May 6 10:50:24 EDT 2024


To follow up:

I found and solved the issue. The root cause of the problem was conversion between types.

The makefile I was using did not apply the correct default real flags. Variables, in particular pdx and/or pdy, passed from

amr_prolong_gen_unk1_fun were sent to routines in umap.F. The values were ok in amr_prolong_gen_unk1_fun, but they

were set to zero upon arrival in the routines in umap.F.

Running the Sedov test reproduced the issue, but resulted in a simpler debugging process and I found the issue.

I am mostly replying here in case this comes up in the future.

Put shortly: Conversion between real*8 and real*4 can cause bugs in the guard-cell fill. Make sure the right flags

are set for both F90 and F77.


Thanks for your help Professor Reyes.


Best,
Haakon

________________________________
From: Haakon Andresen
Sent: 30 April 2024 11:28:40
To: Reyes, Adam
Cc: flash-users at flash.rochester.edu
Subject: Re: [EXT] [FLASH-USERS] Code crash when moving to an AMD based machine


It does not happen with a uniform grid, this is also consistent with the fact that it does not happen at any

of the blocks where there is no refinement boundary.


It actually seems to happen at the first refinement boundary, there is nothing else special about that location.

It happens right after the initialization is done. The code gets into the hydro solver, where it needs to fill some guard cells,

and during the guard-cell filling it crashes at the first refinement boundary it encounters.  The guard cells are filled

appropriately at the other end of the block, ie the end not near the refinement boundary.

I fully agree with your assessment, it looks like an uninitialized value or a division by zero. The question is then,

where this this come from?


Best,
Haakon


________________________________
From: Reyes, Adam <adam.reyes at rochester.edu>
Sent: 30 April 2024 11:08:44
To: Haakon Andresen
Cc: flash-users at flash.rochester.edu
Subject: Re: [EXT] [FLASH-USERS] Code crash when moving to an AMD based machine

That density looks like an uninitialized value or a divide by zero both of which could be handled differently depending on the compiler. If this is, as you say, is happening at an interior boundary it should just be doing prolongation/restriction of interior data from another block.

A couple of questions I have:

  *   Does this happen only if you include a refinement jump, or also with a uniform refinement?
  *   Does it happen during initialization or in the middle of a simulation after the evolution has begun?

*********************************************
Adam Reyes

[FLASH.jpg]
Code Group Leader, Flash Center for Computational Science
Research Scientist, Dept. of Physics and Astronomy
University of Rochester
River Campus: Bausch and Lomb Hall, 369
500 Wilson Blvd. PO Box 270171, Rochester, NY 14627
Email adam.reyes at rochester.edu
Web https://flash.rochester.edu
 (he / him / his)
[FLASH-pride-sml.png]

*********************************************



On Apr 30, 2024, at 10:36 AM, Haakon Andresen <haakon.andresen at astro.su.se> wrote:

Hi,

I can with one caveat: I am using a modified version of FLASH4. I suspect the error is coming from part of the code that
has not been modified, but I am not 100% certain at this point in time.

I am using the Intel compilers on the machine where it works and cray (also tested gnu) on the machine where the code crashes.
After talking to a colleague, I have learned that the code compiled with Intel compilers runs on the AMD EPYC 7763 CPUs.
However, the Intel compilers are not available on the machine I am using.

Boundary conditions are "reflect" at the inner-boundary and "user" at the outer boundary
(the user condition essentially specifies an outflow/inflow with some modifications for density/gpot/radiation...
/Simulation/SimulationMain/CoreCollapse/Grid_bcApplyToRegionSpecialized.F90<https://urldefense.com/v3/__https://github.com/snaphu-msu/BANG/blob/master/source/Simulation/SimulationMain/CoreCollapse/Grid_bcApplyToRegionSpecialized.F90__;!!CGUSO5OYRnA7CQ!f__vTvYYlGfba4-Oyv1RWCg_LbkVjHUoZTCAXd4NtbxngsxkeIgCHlCBzEFCKAz8BdVnv1luHlZXGOTpliubyVybqJfDIgbq6eQ$> if it exists in the standard FLASH version).

I have traced the issue back to negative density values in some guard cells near an refinement boundary, but far away from the outer grid boundary and
not close to the inner grid boundary.

As for error messages, it is
"Error message is [EOS] rho < rhomin"
when using Cray and

"
 Newton-Raphson failed in subroutine eos_helmholtz
 (e and rho as input):

 too many iterations          50

  temp =                        NaN
  dens =   -2.2471185595023190E+307
  pres =                        NaN

"
when using GNU. I am not sure these are informative for you since they could be specific to our version of

FLASH. The root cause is the negative densities that somehow appear in the guard-cell fill.

Thanks for your help.

Best,
Haakon
________________________________
From: Reyes, Adam <adam.reyes at rochester.edu>
Sent: 30 April 2024 10:04:20
To: Haakon Andresen
Cc: flash-users at flash.rochester.edu
Subject: Re: [FLASH-USERS] Code crash when moving to an AMD based machine

Hi Haakon,

Could you share a bit more context about what you’re observing, maybe the exact error from FLASH and the boundary conditions that you’re using? Are you using the same compiler between the two machines?
*********************************************
Adam Reyes

<FLASH.jpg>
Code Group Leader, Flash Center for Computational Science
Research Scientist, Dept. of Physics and Astronomy
University of Rochester
River Campus: Bausch and Lomb Hall, 369
500 Wilson Blvd. PO Box 270171, Rochester, NY 14627
Email adam.reyes at rochester.edu
Web https://flash.rochester.edu
 (he / him / his)
<FLASH-pride-sml.png>

*********************************************



On Apr 29, 2024, at 4:02 PM, Haakon Andresen <haakon.andresen at astro.su.se> wrote:

Dear Flash users,

I am currently testing FLASH on a machine with AMD CPUs, specifically AMD EPYC 7763. Previously, I have only used the code on Intel based architecture. I have encountered a bug, which I believe is related to guard-cell fills, but I have not found the root cause yet.

I am doing core-collapse simulations with paramesh, the code initializes and writes the first checkpoint file, but then crashes in hydro-solver. The crash occurs during a call to the subroutine that is responsible for filling guard cells. My debugging has lead me all the way into the paramesh routines. The error, guard cells being filled with bad values, happens at a refinement boundary. The test is done in 1D.

The puzzling part is that the code, with the exact same setup, runs just fine on a different machine (an intel machine). I am wondering if anyone have seen similar behavior in the past. I am not sure that the root cause is in paramesh, but if anyone have experience anything similar I would love to hear about it, maybe it will help me identify the issue.

Best,
Haakon Andresen
_______________________________________________
flash-users mailing list
flash-users at flash.rochester.edu<mailto:flash-users at flash.rochester.edu>

For list info, including unsubscribe:
https://flash.rochester.edu/mailman/listinfo/flash-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20240506/a8137eeb/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: FLASH.jpg
Type: image/jpeg
Size: 23876 bytes
Desc: FLASH.jpg
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20240506/a8137eeb/attachment-0002.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: FLASH-pride-sml.png
Type: image/png
Size: 12732 bytes
Desc: FLASH-pride-sml.png
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20240506/a8137eeb/attachment-0002.png>


More information about the flash-users mailing list