Subsections

8.10 GridSolvers

**Figure 8.11:** The `Grid` unit: structure of `GridSolvers` subunit.

The GridSolvers unit groups together subunits that are used to solve particular types of differential equations. Currently, there are two types of solvers: a parallel Fast Fourier Transform package (Sec:GridSolversPfft) and various solvers for the Poisson equation (Sec:GridSolversPoisson).

8.10.1 Pfft

Pfft is a parallel frame work for computing a Fast Fourier Transform (FFT) on uniform grids. It can also be combined with the Multigrid solver described below in Sec:GridSolversMultigrid to let the composite solver scale to thousands of processors.

Pfft has a layered architecture where the lower layer contains functions implementing basic tasks and primary data structures. The upper layer combines pieces from the lowest layer to interface with FLASH and create the parallel transforms. The computational part of Pfft is handled by sequential 1-dimensional FFT's, which can be from a native, vendor supplied scientific library, or from public domain packages. The current distribution of FLASH uses fftpack from NCAR for the 1-D FFTs, since that package also contains transforms that are useful with non-periodic boundary conditions.

The lowest layer has three distinct components. The first component redistributes data. It includes routines for distributed transposes for two and three dimensional data. The second component provides a uniform interface for FFT calls to hide the details of individual libraries. The third component is the data structures. There are global data structures to keep track of the type of transform, number of data dimensions, and physical and transform space information about each dimension. The dimensional information includes the start and end point of data (useful when the dimension is spread over more than one processor), the MPI communicator, the coordinates of the node in the processor grid etc. The structures also include pointers to the trigonometric tables and work space needed by the sequential FFT calls.

The upper layer of PFFT combines the lower layer routines to create end-to-end transforms in a variety of ways. The available one dimensional transforms include real-to-complex, complex-to-complex, sine and cosine transforms. They can be combined to create two or three dimensional tranforms for different configuration of the domain. The two dimensional transforms support parallelization of one dimension (or a one dimensional grid of processors). The three dimensional transforms support one or two dimensional grid of processors. All transforms must have at least one dimension within the processor at all times. The data distribution changes during the computation. However, a complete cycle of forward and inverse transform restores the data distribution.

The computation of a forward three dimensional FFT in parallel involves following steps :

Compute 1-D transforms along x.
Reorder, or transpose from x-y-z to y-z-x
Compute 1-D transforms along y. If the transform along x was real-to-complex, this one must be a complex-to-complex transform.
Transpose from y-z-x to z-x-y
Compute 1-D FFTs along z. If the transform along x or y was real-to-complex, this must be a complex-to-complex transform.

The inverse transform can be computed by carrying out the steps described above in reverse order. The more commonly used domain decomposition in FFT based codes assumes a one dimensional processor grid:

$\displaystyle N_1 \times N_2 \times N_3/P,$

(8.5)

where $N_1 \times N_2 \times N_3$ is the global data size and

is the number of processors. Here, the first transpose is local, while the second one is distributed. The internode communication is limited to one distributed transpose involving all the processors. However, there are two distinct disadvantages of this distribution of work:

The size of the problem imposes an upper limit on the number of processors, in that the largest individual dimension is also the largest number of active processors. A three dimensional problem is forced to have modest individual dimensions to fit in the processor memory.
As the machine size grows, the internode exchanges become long range, and the possibility of contention grows.

We have chosen a domain decomposition where each subdomain is a column of size

$\displaystyle N_1 \times N_2/P_1 \times N_3/P_2$			(8.6)
$\displaystyle P = P_1 \times P_2.$

With this distribution both the transposes are distributed in parallel. The data exchange along any one processor grid dimension is a collection of disjointed distributed transposes. Here, the contention and communication range is reduced, while the volume of data exchange is unaltered. The distributed transposes are implemented using collective MPI operation alltoall. In a slabwise distribution, the upper limit on the number of processors is determined by the smallest of

, where as in our distribution, the upper limit on the number of processors is the smallest of

8.10.1.1 Using `Pfft`

Pfft can only be used with a pencil grid, with the constraint that the number of processors along the IAXIS must be 1. This is because all one dimensional transforms are computed locally within a processor. However, FLASH contains a set of data movement subroutines that generate a usable pencil grid from any UG grid or any level of a PM grid. These routines are briefly explained in Section 8.10.1.2.

During the course of a simulation, Pfft can be used in two different modes. In the first mode, every instance of Pfft use will be exactly identical to every other instance in terms of domain size and the type of transforms. In this mode, the user can set the runtime parameter pfft_setupOnce to true, which enables the FLASH initialization process to also create and initialize all the data structures of Pfft. The finalization of the Pfft subunit is also done automatically by the FLASH shutdown process in this mode. However, if a simulation needs to use Pfft in different configurations at different instances of its use, then each set of calls to Grid_pfft for computing the transforms must be preceded by a call to Grid_pfftInit and followed by a call to Grid_pfftFinalize. In addition, the runtime parameter pfft_setupOnce should be set to false. A few other helper routines are available in the subunit to allow the user to query Pfft about the dimensioning of the domain, and also to map the Mesh variables from the unk array to and from Pfft compatible (single dimensional) arrays. Pfft also provides the location of wave numbers in the parallel domain; information that users can utilize to develop their own customized PDE solvers using FFT based techniques.

8.10.1.2 `Pfft` data movement subroutines

Mesh reconfiguration subroutines are available to generate a pencil grid for the Pfft unit from another mesh configuration. Two different implementations are available at Grid/GridSolvers/Pfft/MeshReconfiguration/PtToPt and Grid/GridSolvers/Pfft/MeshReconfiguration/Collective, with the PtToPt implementation being the default. Both implementations are able to generate an appropriate pencil grid in UG and PM mode. The pencil processor grid is automatically selected, but can be overridden by passing optional arguments to Grid_pfftInit. In UG mode they are invoked when the number of processors in the IAXIS of the FLASH grid is greater than one, and in PM mode they are always invoked. In PM mode they generate a pencil grid from a single level of the AMR grid, which may be manually specified or automatically selected as the maximum level that is fully-refined (i.e. has blocks that completely cover the computational domain at this level).

The pencil grid processor topology is stored in an MPI communicator, and the communicator may contain fewer processors than are used in the simulation. This is to ensure the pencil grid points are never distributed too finely over the processors, and naturally handles the case where the user may wish to obtain a pencil grid at a very coarse level in the AMR grid. If there are more blocks than processors then we are safe to distribute the pencil grid over all processors, otherwise we must remove a number of processors. Currently, we eliminate those processors that own zero FLASH blocks at this level, as this is a simple calculation that can be computed locally.

Both mesh reconfiguration implementations generate a map describing the data movement before moving any grid data. The map is retained between calls to the Pfft routines and is only regenerated when the grid changes. This avoids repeating the same global communications, but means communication buffers are left allocated between calls to Pfft.

In the Collective implementation, the map coordinates are used to specify where the FLASH data is copied into a send communication buffer. Two MPI_Alltoall calls then move this data to the appropriate pencil processor at coordinates (J,K). Here, the first MPI_Alltoall moves data to processor (J,0), and the second MPI_Alltoall moves data to processor (J,K). The decision to use MPI_Alltoalls simplifies the MPI communication, but leads to very large send / receive communication buffers on each processor which consume:

Memory(bytes) = sizeof(REAL) * total grid points at solve level * 2

The PtToPt implementation consumes less memory compared to the Collective implementation, and communicates using point to point MPI messages. It is based upon using nodes in a linked list which contain metadata (a map) and a communication buffer for a single block fragment. There are two linked lists: one for the FLASH block fragments and one for Pfft block fragments. Metadata information about each FLASH block fragment is placed in separate messages and sent using MPI_Isend to the appropriate destination pencil grid processor.

Each destination pencil grid processor repeatedly invokes MPI_Iprobe using MPI_ANY_SOURCE, and creates a node in its Pfft list whenever it discovers a message. The MPI message is received into a metadata region of the freshly allocated node, and a communication buffer is also allocated according to the size specified in the metadata. The pencil processor continues probing for messages until the cumulative size of its node's communication buffers is equal to the pencil grid points it has been assigned. At this stage, grid data is communicated by traversing the Pfft list and posting MPI_Irecvs, and then traversing the FLASH list and sending block fragment using MPI_Isends. After performing MPI_Waits, the received data in the nodes of the Pfft list is copied into internal Pfft arrays.

Note, the linked list is constructed using an include file stored at flashUtilities/datastructures/linkedlist. The file is named ut_listMethods.includeF90 and is meant to be included in any Fortran90 module to create lists with nodes of a user-defined type. Please see the README file, and the unit test example at flashUtilities/datastructures/linkedlist/UnitTest.

8.10.1.3 Unit Test

The unit test for Pfft solver solves the following equation:

$\displaystyle \nabla^2({\bf F})=-13.0*\cos2x*\sin3y$

(8.7)

The simplest analytical solution of this equation assuming no constants is

$\displaystyle F=\cos2x*\sin3y$

(8.8)

We discretize the domain by assuming

, and $xmax,ymax,zmax=2\pi$ . The equation satisfies periodic boundary conditions in this formulation and FFT based poisson solve techniques can be applied. In the unit test we initialize one variable of the solution data with the function

, and another one with the right hand side of (8.7). We compute the forward real-to-complex transform of the solution data variable that is initialized with the right hand side of (8.7). This variable is then divided by $({k_i}^2+{k_j}^2+{k_k}^2)$ where ${k_i, k_j}$ and ${k_k}$ are the wavenumbers at any point i,j,k in the domain. An inverse complex-to-real transform after the division should give the function

as result. Hence the unit test is considered successful if both the variables have matching values within the specified tolerance.

8.10.2 Poisson equation

The GridSolvers subunit contains several different algorithms for solving the general Poisson equation for a potential $\phi({\bf x})$ given a source $\rho({\bf x})$

$\displaystyle \nabla^2\phi({\bf x}) = \alpha\rho({\bf x}) .$

(8.9)

Here $\alpha$ is a constant that depends upon the application. For example, when the gravitational Poisson equation is being solved, $\rho({\bf x})$ is the mass density, $\phi({\bf x})$ is the gravitational potential, and $\alpha = 4\pi G$ , where

is Newton's gravitational constant.

8.10.2.1 Multipole Poisson solver (original version)

This section describes the multipole Poisson solver that has been included in all the past releases of FLASH. It is included in the current release also, however, certain limitations found in this solver lead to a new implementation described in Sec:GridSolversMultipoleImproved. This version is retained in FLASH4, because the new version is missing the ability to treat a non-zero minimal radius for spherical geometries and the ability to specify a point mass contribution to the potential. This will be implemented for the next coming release.

The multipole Poisson solver is appropriate for spherical or nearly-spherical source distributions with isolated boundary conditions (Müller (1995)). It currently works in 1D and 2D spherical, 2D axisymmetric cylindrical (), and 3D Cartesian and axisymmetric geometries. Because of the imposed symmetries, in the 1D spherical case, only the monopole term ( $\ell = 0$ ) makes sense, while in the axisymmetric and 2D spherical cases, only the moments are used (i.e., the basis functions are Legendre polynomials).

The multipole algorithm consists of the following steps. First, find the center of mass ${\bf x}_{\rm cm}$

$\displaystyle {\bf x}_{\rm cm} = {\int d^3{\bf x} {\bf x}\rho({\bf x}) \over \int d^3{\bf x} \rho({\bf x})} .$

(8.10)

We will take ${\bf x}_{\rm cm}$ as our origin. In integral form, Poisson's ((8.9)) is

$\displaystyle \phi({\bf x}) = -{\alpha\over 4\pi}\int d^3{\bf x}' {\rho({\bf x}')\over \vert{\bf x} - {\bf x}'\vert} .$

(8.11)

The Green's function for this equation satisfies the relationship

$\displaystyle {1\over \vert{\bf x} - {\bf x}'\vert} = 4\pi\sum_{\ell=0}^\infty ... ...over r_>^{\ell+1}} Y_{\ell m}^*(\theta',\varphi') Y_{\ell m}(\theta,\varphi) ,$

(8.12)

where the components of ${\bf x}$ and ${\bf x}'$ are expressed in spherical coordinates $(r,\theta,\varphi)$ about ${\bf x}_{\rm cm}$ , and

$\displaystyle r_<$	$\displaystyle \equiv$	$\displaystyle \min\{ \vert{\bf x}\vert, \vert{\bf x}'\vert \}$	(8.13)
$\displaystyle r_>$	$\displaystyle \equiv$	$\displaystyle \max\{ \vert{\bf x}\vert, \vert{\bf x}'\vert \} .$

Here $Y_{\ell m}(\theta,\varphi)$ are the spherical harmonic functions

$\displaystyle Y_{\ell m}(\theta,\varphi) \equiv (-1)^m \sqrt{ {2\ell+1\over 4\pi} {(\ell-m)!\over (\ell+m)!} } P_{\ell m}(\cos\theta) e^{im\varphi} .$

(8.14)

$P_{\ell m}(x)$ are Legendre polynomials. Substituting (8.12) into (8.11), we obtain

$\displaystyle \phi({\bf x})$	$\displaystyle =$	$\displaystyle -\alpha \sum_{\ell=0}^\infty \sum_{m=-\ell}^\ell {1\over 2\ell+1} \Biggl\{ Y_{\ell m}(\theta,\varphi) \times$	(8.15)
		$\displaystyle \left[ r^\ell \int_{r<r'} d^3{\bf x}' {\rho({\bf x}') Y_{\ell m}^... ...x}' \rho({\bf x}') Y_{\ell m}^*(\theta',\varphi') {r'}^\ell \right]\Biggr\} .$

In practice, we carry out the first summation up to some limiting multipole $\ell_{\rm max}$ . By taking spherical harmonic expansions about the center of mass, we ensure that the expansions are dominated by low-multipole terms, so that for a given value of $\ell_{\rm max}$ , the error created by neglecting high-multipole terms is minimized. Note that the product of spherical harmonics in (8.15) is real-valued

$\displaystyle \sum_{m=-\ell}^\ell Y_{\ell m}^*(\theta',\varphi') Y_{\ell m}(\theta,\varphi)$	$\displaystyle =$	$\displaystyle {2\ell+1\over 4\pi} \Biggl[ P_{\ell 0}(\cos\theta) P_{\ell 0}(\cos\theta') +$	(8.16)
		$\displaystyle 2\sum_{m=1}^\ell {(\ell-m)!\over (\ell+m)!} P_{\ell m}(\cos\theta)P_{\ell m}(\cos\theta') \cos\left(m(\varphi-\varphi')\right) \Biggr] .$

Using a trigonometric identity to split up the last cosine in this expression and substituting for the inner sums in (8.15), we obtain

$\displaystyle \phi({\bf x})$	$\displaystyle =$	$\displaystyle -{\alpha\over 4\pi} \sum_{\ell=0}^\infty P_{\ell 0}(\cos\theta)\l... ...u^{\rm eo}_{\ell 0}(r) + {1\over r^{\ell+1}} \mu^{\rm ei}_{\ell 0}(r) \right] -$
		$\displaystyle {\alpha\over 2\pi} \sum_{\ell=1}^\infty \sum_{m=1}^\ell P_{\ell m... ...i) \mu^{\rm eo}_{\ell m}(r) + (r^\ell \sin m\varphi) \mu^{\rm oo}_{\ell m}(r) +$	(8.17)
		$\displaystyle \qquad\qquad\qquad\qquad\qquad {\cos m\varphi\over r^{\ell+1}} \m... ...ell m}(r) + {\sin m\varphi\over r^{\ell+1}} \mu^{\rm oi}_{\ell m}(r) \biggr] .$

The even (e)/odd (o), inner (i)/outer (o) source moments in this expression are defined to be

$\displaystyle \mu^{\rm ei}_{\ell m}(r)$	$\displaystyle \equiv$	$\displaystyle {(\ell-m)!\over (\ell+m)!} \int_{r>r'} d^3{\bf x}' {r'}^\ell \rho({\bf x}') P_{\ell m}(\cos\theta') \cos m\varphi'$	(8.18)
$\displaystyle \mu^{\rm oi}_{\ell m}(r)$	$\displaystyle \equiv$	$\displaystyle {(\ell-m)!\over (\ell+m)!} \int_{r>r'} d^3{\bf x}' {r'}^\ell \rho({\bf x}') P_{\ell m}(\cos\theta') \sin m\varphi'$	(8.19)
$\displaystyle \mu^{\rm eo}_{\ell m}(r)$	$\displaystyle \equiv$	$\displaystyle {(\ell-m)!\over (\ell+m)!} \int_{r<r'} d^3{\bf x}' {\rho({\bf x}')\over {r'}^{\ell+1}} P_{\ell m}(\cos\theta') \cos m\varphi'$	(8.20)
$\displaystyle \mu^{\rm oo}_{\ell m}(r)$	$\displaystyle \equiv$	$\displaystyle {(\ell-m)!\over (\ell+m)!} \int_{r<r'} d^3{\bf x}' {\rho({\bf x}')\over {r'}^{\ell+1}} P_{\ell m}(\cos\theta') \sin m\varphi' .$	(8.21)

The procedure is thus to compute the moment integrals ((8.18)

(8.21)) for a given source field $\rho({\bf x})$ , and then to use these moments in (8.17) to compute the potential.

In practice, the above procedure must take account of the fact that the source and the potential are assumed to be cell-averaged quantities discretized on a block-structured mesh with varying cell size. Also, because of the radial dependence of the multipole moments of the source function, these moments must be tabulated as functions of distance from ${\bf x}_{\rm cm}$ , with an implied discretization. The solver allocates storage for moment samples spaced a distance $\Delta$ apart in radius

$\displaystyle \mu^{\rm ei}_{\ell m,q}\equiv \mu^{\rm ei}_{\ell m}(q\Delta)$		$\displaystyle \mu^{\rm eo}_{\ell m,q}\equiv \mu^{\rm eo}_{\ell m}((q-1)\Delta)$	(8.22)
$\displaystyle \mu^{\rm oi}_{\ell m,q}\equiv \mu^{\rm oi}_{\ell m}(q\Delta)$		$\displaystyle \mu^{\rm oo}_{\ell m,q}\equiv \mu^{\rm oo}_{\ell m}((q-1)\Delta) .$	(8.23)

The sample index

varies from 0 to

( $\mu^{\rm eo}_{\ell m,0}$ and $\mu^{\rm oo}_{\ell m,0}$ are not used). The sample spacing $\Delta$ is chosen to be one-half the geometric mean of the

, and

cell spacings at the highest level of refinement, and

is chosen to be large enough to span the diagonal of the computational volume with samples.

Determining the contribution of individual cells to the tabulated moments requires some care. To reduce the error caused by the grid geometry, in each cell an optional subgrid can be establish (see mpole_subSample) consisting of points at the locations ${\bf x}'_{i'j'k'}$ , where

$\displaystyle x'_{i'} = x_i + (i'-0.5(N'-1)){\Delta x_i\over {N'}} ,$	$\displaystyle i' = 0\ldots {N'}-1$	(8.24)
$\displaystyle y'_{j'} = y_j + (j'-0.5(N'-1)){\Delta y_j\over {N'}} ,$	$\displaystyle j' = 0\ldots {N'}-1$	(8.25)
$\displaystyle z'_{k'} = z_k + (k'-0.5(N'-1)){\Delta z_k\over {N'}} ,$	$\displaystyle k' = 0\ldots {N'}-1 ,$	(8.26)

and where ${\bf x}_{ijk}$ is the center of cell

. (For clarity, we have omitted

indices on ${\bf x}'$ as well as all block indices.) For each subcell, we assume $\rho({\bf x}'_{i'j'k'})\approx\rho_{ijk}$ and then apply

$\displaystyle \mu^{\rm ei}_{\ell m,q\ge q'}$	$\displaystyle \leftarrow$	$\displaystyle \mu^{\rm ei}_{\ell m,q\ge q'} + {(\ell-m)!\over (\ell+m)!} {\Delt... ...rho({\bf x}'_{i'j'k'}) P_{\ell m}(\cos\theta'_{i'j'k'}) \cos m\varphi'_{i'j'k'}$	(8.27)
$\displaystyle \mu^{\rm oi}_{\ell m,q\ge q'}$	$\displaystyle \leftarrow$	$\displaystyle \mu^{\rm oi}_{\ell m,q\ge q'} + {(\ell-m)!\over (\ell+m)!} {\Delt... ...rho({\bf x}'_{i'j'k'}) P_{\ell m}(\cos\theta'_{i'j'k'}) \sin m\varphi'_{i'j'k'}$	(8.28)
$\displaystyle \mu^{\rm eo}_{\ell m,q\le q'}$	$\displaystyle \leftarrow$	$\displaystyle \mu^{\rm eo}_{\ell m,q\le q'} + {(\ell-m)!\over (\ell+m)!} {\Delt... ...r'}_{i'j'k'}^{\ell+1}} P_{\ell m}(\cos\theta'_{i'j'k'}) \cos m\varphi'_{i'j'k'}$	(8.29)
$\displaystyle \mu^{\rm oo}_{\ell m,q\le q'}$	$\displaystyle \leftarrow$	$\displaystyle \mu^{\rm oo}_{\ell m,q\le q'} + {(\ell-m)!\over (\ell+m)!} {\Delt... ...{i'j'k'}^{\ell+1}} P_{\ell m}(\cos\theta'_{i'j'k'}) \sin m\varphi'_{i'j'k'} ,$	(8.30)

where

$\displaystyle q' = \left\lfloor{\vert{\bf x}'_{i'j'k'} }\vert \over {\Delta} \right\rfloor + 1$

(8.31)

is the index of the radial sample within which the subcell center lies. These expressions introduce (hopefully) small errors when compared to ((8.18))

(8.21), because the subgrid volume elements are not spherical. These errors are greatest when $r' \sim \Delta x$ ; hence, using a subgrid reduces the amount of source affected by these errors. An error of order $\Delta^2$ is also introduced by assuming the source profile within each cell to be flat. Note that the total source computed by this method ( $\mu^{\rm ei}_{\ell m,N_q}$ ) is exactly equal to the total implied by $\rho_{ijk}$ .

Another way to reduce grid geometry errors when using the multipole solver is to modify the AMR refinement criterion to refine all blocks containing the center of mass (in addition to other criteria that may be used, such as the second-derivative criterion supplied with PARAMESH). This ensures that the center-of-mass point is maximally refined at all times, further restricting the volume which contributes errors to the moments because $r' \sim \Delta x$ .

The default value of is 1; note that large values of this parameter very quickly increase the amount of time required to evaluate the multipole moments (as ${N'}^3$ ). In order to speed up the moment summations, the sines and cosines in ((8.27)) (8.30) are evaluated using trigonometric recurrence relations, and the factorials are pre-computed and stored at the beginning of the run.

When computing the cell-averaged potential, we again employ a subgrid, but here the subgrid points fall on cell boundaries to improve the continuity of the result. Using subgrid points per dimension, we have

$\displaystyle x'_{i'} = x_i + (i'-0.5N')){\Delta x_i\over {N'}} ,$	$\displaystyle i' = 0\ldots {N'}$	(8.32)
$\displaystyle y'_{j'} = y_j + (j'-0.5N')){\Delta y_j\over {N'}} ,$	$\displaystyle j' = 0\ldots {N'}$	(8.33)
$\displaystyle z'_{k'} = z_k + (k'-0.5N')){\Delta z_k\over {N'}} ,$	$\displaystyle k' = 0\ldots {N'} .$	(8.34)

The cell-averaged potential in cell

is then

$\displaystyle \phi_{ijk} = {1\over {N'}^3} \sum_{i'j'k'} \phi({\bf x}'_{i'j'k'}) ,$

(8.35)

where the terms in the sum are evaluated via (8.17) up to the limiting multipole order $\ell_{\rm max}$ .

8.10.2.2 Multipole Poisson solver (improved version)

The multipole Poisson solver is based on a multipolar expansion of the source (mass for gravity, for example) distribution around a conveniently chosen center of expansion. The angular number entering this expansion is a measure of how detailed the description of the source distribution will be on an angular basis. Higher values mean higher angular resolution with respect to the center of expansion. The multipole Poisson solver is thus appropriate for spherical or nearly-spherical source distributions with isolated boundary conditions. For problems which require high spatial resolution throughout the entire domain (like, for example, galaxy collision simulations), the multipole Poisson solver is less suited, unless one is willing to go to extremely (computationally unfeasible) high values. For stellar evolution, however, the multipole Poisson solver is the method of choice.

The new implementation of the multipole Poisson solver is located in the directory

source/Grid/GridSolvers/Multipole_new.

This implementation improves upon the original implemention in many ways: i) efficient memory layout ii) elimination of numerical over- and underflow errors for large angular momenta when using astrophysical (dimensions $\approx 10^9$ ) domains iii) elimination of subroutine call overhead (1 call per cell), iv) minimization of error due to non-statistical distributions of moments near the multipolar origin. The following paragraphs explain the new approach to the multipole solver and an explanation of the above improvements. Details about the theory of the new implementation of the Poisson solver can be found in Couch et al. (2013).

The multipole Poisson solver is appropriate for spherical or nearly-spherical source distributions with isolated boundary conditions. It currently works in 1D spherical, 2D spherical, 2D cylindrical, 3D Cartesian and 3D cylindrical. Symmetries can be specified for the 2D spherical and 2D cylindrical cases (a horizontal symmetry plane along the radial axis) and the 3D Cartesian case (assumed axisymmetric property). Because of the radial symmetry in the 1D spherical case, only the monopole term ( $\ell = 0$ ) contributes, while for the 3D Cartesian axisymmetric, the 2D cylindrical and 2D spherical cases only the moments need to be used (the other $m\neq 0$ moments effectively cancel out).

The multipole algorithm consists of the following steps. First, the center of the multipolar expansion ${\bf x}_{\rm cen}$ is determined via density-squared weighted integration over position:

$\displaystyle {\bf x}_{\rm cen} = {\int {\bf x}\rho^2({\bf x}) d{\bf x} \over \int \rho^2({\bf x}) d{\bf x}}.$

(8.36)

We will take ${\bf x}_{\rm cen}$ as our origin. In integral form, Poisson's equation (8.9) becomes

$\displaystyle \phi({\bf x}) = -{\alpha\over 4\pi}\int {\rho({\bf x}')\over \vert{\bf x} - {\bf x}'\vert} d{\bf x}'.$

(8.37)

The inverse radial distance part can be expanded in terms of Legendre polynomials

$\displaystyle {1\over \vert{\bf x} - {\bf x}'\vert} = \sum_{\ell=0}^\infty {x_<^\ell\over x_>^{\ell+1}}P_\ell (\cos\gamma),$

(8.38)

where

(

) indicate the smaller (larger) of the magnitudes and $\gamma$ denotes the angle between ${\bf x}$ and ${\bf x}'$ . Note, that this definition includes those cases where both magnitudes are equal. The expansion is always convergent if $\cos\gamma <1$ . Expansion of the Legendre polynomials in terms of spherical harmonics gives

$\displaystyle P_\ell (\cos\gamma) = {4\pi\over 2\ell+1}\sum_{m=-\ell}^{+\ell} Y_{\ell m}^*(\theta',\phi') Y_{\ell m}(\theta,\phi),$

(8.39)

where $\theta,\phi$ and $\theta',\phi'$ are the spherical angular components of ${\bf x}$ and ${\bf x}'$ about ${\bf x}_{\rm cen}$ . Defining now the regular $R_{\ell m}$ and irregular $I_{\ell m}$ solid harmonic functions

$\displaystyle R_{\ell m}(x_<)$	$\displaystyle =$	$\displaystyle \sqrt{{4\pi\over {2\ell+1}}}x_<^\ell Y_{\ell m}(\theta,\phi)$	(8.40)
$\displaystyle I_{\ell m}(x_>)$	$\displaystyle =$	$\displaystyle \sqrt{{4\pi\over {2\ell+1}}}{Y_{\ell m}(\theta,\phi)\over x_>^{\ell+1}},$	(8.41)

we can rewrite Eq.(8.37) in the form

$\displaystyle \phi({\bf x}) = -{\alpha\over 4\pi} \int \sum_{\ell m}R_{\ell m}(x_<)I_{\ell m}^*(x_>)\rho({\bf x}') d{\bf x}',$

(8.42)

where the summation sign is a shorthand notation for the double sum over all the allowed $\ell$ and

values. In FLASH both the source and the potential are assumed to be cell-averaged quantities discretized on a block-structured mesh with varying cell size. The integration must hence be replaced by a summation over all leaf cells

$\displaystyle \phi(q) = -{\alpha\over 4\pi} \sum_{q'} \sum_{\ell m}R_{\ell m}(q_<)I_{\ell m}^*(q_>)m(q'),$

(8.43)

where

denotes the cell's mass. Note, that the symbol

stands for cell index as well as its defining distance position from the expansion center in the computational domain. This discrete Poisson equation is incorrect. It contains the divergent

term on the rhs. The

contribution to the potential corresponds to the cell self potential $\phi_{Self}(q)$ and is divergent in our case because all the cell's mass is assumed to be concentrated at the cell's center. The value of this divergent term can easily be calculated from Eq.(8.38) by setting $\cos\gamma = 1$ :

$\displaystyle \phi_{Self}(q)$

$\displaystyle =$

$\displaystyle m(q){L+1\over x_q},$

(8.44)

where

is the cell's mass,

the highest angular number considered in the expansion and

the radial distance of the cell center from the expansion center. To avoid this divergence problem, we evaluate the potentials on each face of the cell and form the average of all cell face potentials to get the cell center potential. Eq.(8.43) will thus be replaced by

$\displaystyle \phi(q) = {1\over n_F} \sum_{F} \phi({\bf x}_F)$

(8.45)

and

$\displaystyle \phi({\bf x}_F) = -{\alpha\over 4\pi} \sum_{q'} \sum_{\ell m}R_{\ell m}([q',x_F]_<)I_{\ell m}^*([q',x_F]_>)m(q'),$

(8.46)

where ${\bf x}_F$ is the cell face radial distance from the expansion center and

denotes the larger of the magnitudes between the cell center radial distance

and the cell face radial distance

. Splitting the summation over cells in two parts

$\displaystyle \phi({\bf x}_F) = -{\alpha\over 4\pi}\left\{ \sum_{q'\leq x_F} \s... ...} \sum_{\ell m}R_{\ell m}({\bf x}_F)\left[I_{\ell m}^*(q')m(q')\right]\right\},$

(8.47)

and defining the two moments

$\displaystyle M^R_{\ell m}({\bf x}_F)$	$\displaystyle =$	$\displaystyle \sum_{q'\leq x_F} R_{\ell m}(q')m(q')$	(8.48)
$\displaystyle M^I_{\ell m}({\bf x}_F)$	$\displaystyle =$	$\displaystyle \sum_{q'>x_F} I_{\ell m}(q')m(q'),$	(8.49)

we obtain

$\displaystyle \phi({\bf x}_F) = -{\alpha\over 4\pi}\left[ \sum_{\ell m}M^R_{\el... ...\bf x}_F) + \sum_{\ell m}M^{I*}_{\ell m}({\bf x}_F)R_{\ell m}({\bf x}_F)\right]$

(8.50)

and using vector notation

$\displaystyle \phi({\bf x}_F) = -{\alpha\over 4\pi}\left[ {\bf M}^R({\bf x}_F)\... ...{\bf I}^*({\bf x}_F) + {\bf M}^{I*}({\bf x}_F)\cdot {\bf R}({\bf x}_F) \right].$

(8.51)

We now change from complex to real formulation. We state this for the regular solid harmonic functions, the same reasoning being applied to the irregular solid harmonic functions and all their derived moments. The regular solid harmonic functions can be split into a real and imaginary part

$\displaystyle R_{\ell m} = R_{\ell m}^c + i R_{\ell m}^s.$

(8.52)

The labels 'c' and 's' are shorthand notations for 'cosine' and 'sine', reflecting the nature of the azimuthal function of the corresponding real spherical harmonics. When inserting (8.52) into (8.51) all cosine and sine mixed terms of the scalar products cancel out. Also, due to the symmetry relations

$\displaystyle R_{\ell,-m}^c$	$\displaystyle =$	$\displaystyle (-1)^m R_{\ell m}^c$	(8.53)
$\displaystyle R_{\ell,-m}^s$	$\displaystyle =$	$\displaystyle -(-1)^m R_{\ell m}^s$	(8.54)

we can restrict ourselves to the following polar angle number ranges

$\displaystyle c$	$\displaystyle :$	$\displaystyle \ell\geq 0\;\;,\;\;\ell\geq m \geq 0$	(8.55)
$\displaystyle s$	$\displaystyle :$	$\displaystyle \ell\geq 1\;\;,\;\;\ell\geq m \geq 1.$	(8.56)

The real formulation of (8.51) becomes then

$\displaystyle \phi({\bf x}_F) = -{\alpha\over 4\pi}\left\{ \left[\begin{array}{... ...ay}{l} {\bf R}^c({\bf x}_F) {\bf R}^s({\bf x}_F) \end{array}\right]\right\},$

(8.57)

which, when compared to (8.51), shows, that all vectors now contain a cosine and a sine section. The ${\bf\Delta}$ matrix is a diagonal matrix whose elements are equal to 2 for $m\neq 0$ and 1 otherwise, i.e.:

$\displaystyle {\bf\Delta} = diag(2-\delta_{m0}).$

(8.58)

The recursion relations for calculating the solid harmonic vectors are

$\displaystyle R_{00}^c$	$\displaystyle =$	$\displaystyle 1$	(8.59)
$\displaystyle R_{\ell\ell}^c$	$\displaystyle =$	$\displaystyle - {xR_{\ell-1,\ell-1}^c-yR_{\ell-1,\ell-1}^s\over 2\ell}$	(8.60)
$\displaystyle R_{\ell\ell}^s$	$\displaystyle =$	$\displaystyle - {yR_{\ell-1,\ell-1}^c+xR_{\ell-1,\ell-1}^s\over 2\ell}$	(8.61)
$\displaystyle R_{\ell m}^{c/s}$	$\displaystyle =$	$\displaystyle {(2\ell - 1)zR_{\ell-1,m}^{c/s}-r^2R_{\ell-2,m}^{c/s} \over (\ell + m)(\ell - m)},\;\;\;0\leq m <\ell$	(8.62)

and

$\displaystyle I_{00}^c$	$\displaystyle =$	$\displaystyle {1\over r}$	(8.63)
$\displaystyle I_{\ell\ell}^c$	$\displaystyle =$	$\displaystyle - (2\ell-1){xI_{\ell-1,\ell-1}^c-yI_{\ell-1,\ell-1}^s\over r^2}$	(8.64)
$\displaystyle I_{\ell\ell}^s$	$\displaystyle =$	$\displaystyle - (2\ell-1){yI_{\ell-1,\ell-1}^c+xI_{\ell-1,\ell-1}^s\over r^2}$	(8.65)
$\displaystyle I_{\ell m}^{c/s}$	$\displaystyle =$	$\displaystyle {(2\ell - 1)zI_{\ell-1,m}^{c/s}-\left[(\ell-1)^2-m^2\right] I_{\ell-2,m}^{c/s}\over r^2},\;\;\;0\leq m <\ell$	(8.66)

in which

are the cartesian location coordinates of the cell face and

. For geometries depending on polar angles one must first calculate the corresponding cartesian coordinates for each cell before applying the recursions. In FLASH, the order of the two cosine and sine components for each solid harmonic vector is such that $\ell$ precedes

. This allows buildup of the vectors with maximum number of unit strides. The same applies of course for the assembly of the moments. For 2D cylindrical and 2D spherical geometries only the

parts of both recursions are needed, involving only the cartesian

coordinate and

. Symmetry along the radial axes of these 2D geometries inflicts only the sign change $z\rightarrow -z$ , resulting in the symmetry relations $R_{\ell 0}^c\rightarrow R_{\ell 0}^c$ for even $\ell$ and $R_{\ell 0}^c\rightarrow -R_{\ell 0}^c$ for odd $\ell$ , the same holding for the irregular solid harmonic vector components. Thus symmetry in 2D can effectively be treated by halving the domain size and multiplying each even $\ell$ moments by a factor of 2 while setting the odd $\ell$ moments equal to 0. For 3D cartesian geometries introduction of symmetry is far more complicated since all

components need to be taken into account. It is not sufficient to simply reduce the domain to the appropriate size and multiply the moments by some factor, but rather one would have to specify the exact symmetry operations intended (generators of the symmetry group

or one of its subgroups) in terms of their effects on the

cartesian basis. The resulting complications in calculating symmetry adapted moments outweighs the computational gain that can be obtained from it. Options for 3D symmetry are thus no longer available in the improved FLASH multipole solver. The 'octant' symmetry option from the old multipole solver, using only the monopole $\ell = 0$ term, was too restrictive in its applicability (exact only for up to angular momenta $\ell =3$ due to cancellation of the solid harmonic vector components).

From the above recursion relations (8.59-8.66), the solid harmonic vector components are functions of monomials, where $i+j+k=\ell$ for the ${\bf R}$ and (formally) $i+j+k=-(\ell+1)$ for the ${\bf I}$ . For large astrophysical coordinates and large $\ell$ values this leads to potential computational over- and underflow. To get independent of the size of both the coordinates and $\ell$ we introduce a damping factor for the coordinates for each solid harmonic type before entering the recursions. will be chosen such that for the highest specified $\ell=L$ we will have approximately a value close to 1 for both solid harmonic components:

$\displaystyle R^{c/s}_{Lm}$	$\displaystyle \approx$	$\displaystyle 1$	(8.67)
$\displaystyle I^{c/s}_{Lm}$	$\displaystyle \approx$	$\displaystyle 1.$	(8.68)

This ensures proper handling of size at the solid harmonic function evaluation level and one does not have to rely on size cancellations at a later stage when evaluating the potential via Eq.(8.57). We next state the evaluation of the damping factor

. Due to the complicated nature of the recursions, the first step is to find solid harmonic components which have a simple structure. To do this, consider a cell face with

and $z\neq 0$ . Then

, $\vert z\vert=r$ and only the

components are different from zero. An explicit form can be stated for the absolute values of these components in terms of

$\displaystyle \vert R_{\ell 0}\vert$	$\displaystyle =$	$\displaystyle {r^\ell\over \ell!}$	(8.69)
$\displaystyle \vert I_{\ell 0}\vert$	$\displaystyle =$	$\displaystyle {\ell!\over r^{\ell+1}}.$	(8.70)

Since $r=\sqrt{x^2+y^2+z^2}$ , damping of the coordinates with

results in a damped radial cell face distance

. Inserting this result into (8.69) and (8.70) and imposing conditions (8.67) and (8.68) results in

$\displaystyle D_R = {1\over r}\sqrt[L]{L!}$	$\displaystyle \approx$	$\displaystyle {1\over r}{L\over e}\sqrt[2L]{2\pi L}$	(8.71)
$\displaystyle D_I = {1\over r}\sqrt[L+1]{L!}$	$\displaystyle \approx$	$\displaystyle {1\over r}{L\over e}\sqrt[2L+2]{{2\pi e^2\over L}},$	(8.72)

where the approximate forms are obtained by using Stirling's factorial approximation formula for large

. In FLASH only the approximate forms are computed for

and

to avoid having to deal with factorials of large numbers.

From the moment defining equations (8.48) and (8.49) we see, that the moments are sums over subsets of cell center solid harmonic vectors multiplied by the corresponding cell mass. From Eq.(8.57) it follows that for highest accuracy, the moments should be calculated and stored for each possible cell face. For high refinement levels and/or 3D simulations this would result in an unmanageable request for computer memory. Several cell face positions have to be bundled into radial bins defined by lower and upper radial bounds. Once a cell center solid harmonic vector pair ${\bf R}(q)$ and ${\bf I}(q)$ for a particular cell has been calculated, its radial bin location $q\rightarrow Q$ is determined and its contribution is added to the radial bin moments ${\bf M}^R(Q)$ and ${\bf M}^I(Q)$ . The computational definition of the radial bin moments is

$\displaystyle {\bf M}^R(Q)$	$\displaystyle =$	$\displaystyle \sum_{q\leq Q}{\bf R}(q)m(q)$	(8.73)
$\displaystyle {\bf M}^I(Q)$	$\displaystyle =$	$\displaystyle \sum_{q\geq Q}{\bf I}(q)m(q),$	(8.74)

where $q\leq Q$ means including all cells belonging to

and all radial bins with lower radial boundaries than

. The two basic operations of the multipole solver are thus: i) assembly of the radial bin moments and ii) formation of the scalar products via Eq.(8.57) to obtain the potentials. The memory declaration of the moment array should reflect the way the individual moment components are addressed and the most efficient layout puts the angular momentum indices in rows and the radial bin indices in columns.

How do we extract moments ${\bf M}^R({\bf x})$ and ${\bf M}^I({\bf x})$ at any particular position ${\bf x}$ inside the domain (and, in particular, at the cell face positions ${\bf x}_F$ ), which are ultimately needed for the potential evaluation at that location? Assume that the location ${\bf x}$ corresponds to a particular radial bin ${\bf x}\rightarrow Q$ . Consider the three consecutive radial bins , and , together with their calculated moments:

$\begin{displaymath}\begin{array}{r\vert r\vert r} {\bf M}^R(Q-1) & {\bf M}^R(Q) ... ...) {\bf M}^I(Q-1) & {\bf M}^I(Q) & {\bf M}^I(Q+1) \end{array}\end{displaymath}$

(8.75)

Let us concentrate on the

bin, whose lower and upper radial limits are shown as solid vertical lines. The radial distance corresponding to ${\bf x}$ splits the

bin into two parts: the left fractional part, denoted $R_{frac}$ , and the right fractional part, denoted $I_{frac}$ . Since both ${\bf M}^R(Q-1)$ and ${\bf M}^I(Q+1)$ are completely contained respectively in ${\bf M}^R(Q)$ and ${\bf M}^I(Q)$ , the moments at ${\bf x}$ be approximately evaluated as:

$\displaystyle {\bf M}^R({\bf x})$	$\displaystyle =$	$\displaystyle {\bf M}^R(Q-1) + R_{frac}\left[{\bf M}^R(Q)-{\bf M}^R(Q-1)\right]$	(8.76)
$\displaystyle {\bf M}^I({\bf x})$	$\displaystyle =$	$\displaystyle {\bf M}^I(Q+1) + I_{frac}\left[{\bf M}^I(Q)-{\bf M}^I(Q+1)\right],$	(8.77)

The extraction of the moments via (8.76) and (8.77) is of course an approximation that relies on the statistically dense distribution of the individual cell center moments inside each radial bin. For bins which are reasonably far away from the expansion center this statistical approximation is valid but for those close to the expansion center the statistical distribution does not hold and calculating the moments via the above scheme introduces a large statistical error. The way out of this problem is to move from a statistical radial bin description around the expansion center to a more discrete one, by constructing very narrow isolated radial bins. The code is thus forced to analyze the detailed structure of the geometrical domain grid surrounding the expansion center and to establish the inner radial zone of discrete distributed radial bins. The statistical radial bins are then referred to as belonging to the outer radial zone(s).

While the structure of the inner radial zone is fixed due to the underlying geometrical grid, the size of each radial bin in the outer radial zones has to be specified by the user. There is at the moment no automatic derivation of the optimum (accuracy vs memory cost) bin size for the outer zones. There are two types of radial bin sizes defined for the FLASH multipole solver: i) exponentially and/or ii) logarthmically growing:

exponential bin size upper radial limit	$\displaystyle =$	$\displaystyle s\cdot \Delta r \cdot Q^t$	(8.78)
logarithmic bin size upper radial limit	$\displaystyle =$	$\displaystyle s\cdot \Delta r \cdot {e^{tQ}-1\over e^t-1}.$	(8.79)

In these definitions, $\Delta r$ is a small distance 'atomic' (basic unit) radial measure, defined as half the geometric mean of appropriate cell dimensions at highest refinement level,

is a scalar factor to optionally increase or decrase the atomic unit radial measure and $Q = 1,2,\ldots$ is a local bin index counter for each outer zone. The atomic radial distance $\Delta r$ is calculated for each individual domain geometry as follows:

$\displaystyle \begin{tabular}{c\vert c} Domain Geometry & $\Delta r$ \\ \hlin... ... & ${1\over 2}\Delta R$ \\ 1D spherical & ${1\over 2}\Delta R$ \end{tabular},$

(8.80)

where $\Delta x,\Delta y,\Delta z$ are the usual cartesian cell dimensions and $\Delta R$ is the radial cell dimension. Note, that since $\Delta r$ measures a basic radial unit along the radial distance from the expansion center (which, for approximate spherical problems, is located close to the domain's geometrical origin), only those cell dimensions for calculating each $\Delta r$ are taken, which are directly related to radial distances from the geometrical domain origin. For 3D cylindrical domain geometries for example, only the radial cylindrical and z-coordinate cell dimensions determine the 3D radial distance from the 3D cylindrical domain origin. The angular coordinate is not needed. Likewise for spherical domains only the radial cell coordinate is of importance. Definitions (8.78) and (8.79) define the upper limit of the radial bins. Hence in order to obtain the true bin size for the

-th bin one has to subtract its upper radial limit from the corresponding one of the

-th bin:

$\displaystyle \mbox{$Q$-th exponential bin size}$	$\displaystyle =$	$\displaystyle s\cdot \Delta r \cdot \left[Q^t-(Q-1)^t\right]$	(8.81)
$\displaystyle \mbox{$Q$-th logarithmic bin size}$	$\displaystyle =$	$\displaystyle s\cdot \Delta r \cdot e^{t(Q-1)}.$	(8.82)

In principle the user can specify as many outer zone types as he/she likes, each having its own exponential or logarithmic parameter pair $\{s,t\}$ .

Multithreading of the code is currently enabled in two parts: 1) during moment evaluation and 2) during potential evaluation. The threading in the moment evaluation section is achieved by running multiple threads over separate, non-conflicting radial bin sections. Moment evaluation is thus organized as a single loop over all relevant radial bins on each processor. Threading over the potential evaluation is done over blocks, as these will address different non-conflicting areas of the solution vector.

The improved multipole solver was extensively tested and several runs have been performed using large domains ( $>10^{10}$ ) and extremely high angular numbers up to for a variety of domain geometries. Currently, the following geometries can be handled: 3D cartesian, 3D cylindrical, 2D cylindrical, 2D spherical and 1D spherical. The structure of the code is such that addition of new geometries, should they ever be needed by some applications, can be done rapidly.

8.10.2.3 Multipole Poisson solver unit test (MacLaurin spheroid)

The first unit test for the multipole Poisson solver is based on the MacLaurin spheroid analytical gravitational solution given in section 35.3.4. The unit test sets up a spheroid with uniform unit density and determines both the analytical and numerical gravitational fields. The absolute relative error based on the analytical solution is formed for each cell and the maximum of all the absolute errors is compared to a predefined error tolerance value for a particular uniform refinement level. The multipole unit test runs in 2D cylindrical, 2D spherical, 3D cartesian and 3D cylindrical geometries and threaded multipole unit tests are also available.

8.10.2.4 Tree Poisson Solver

The tree solver is based on the Barnes & Hut (1986, Nature, 324, 446) tree code for calculation of gravity forces in N-body simulations. However, it is more general and it includes some more modern features described for instance in Salmon & Warren (1994, J. Comp. Phys, 136, 155), Springel (2005, MNRAS, 364, 1105) and other works. It builds a global octal tree over the whole computational domain, communicates its part to all processors and uses it for calculations of various physical problems provided as separate units in source/physics directory. The global tree is an extension of the AMR mesh octal tree down to individual cells. The communication of the tree is implemented so that only parts of the tree that are needed for the calculation of the potential on a given processor are sent to it. The calculation of physical problems is done by walking the tree for each grid cell (hereafter point-of-calculation) and evaluating whether the tree node should be used for calculation or whether its children should be open.

The tree solver is connected to physical units by several wrapper subroutines that are called at specific places of the tree build and tree walk, and that call corresponding subroutines of physical units. In this way, physical units can include arbitrary quantity into the tree, and then, use it to calculate some other physical quantity by integrating contributions of all tree nodes during the tree walk. In this version, the only working unit implementation is physics/Gravity/GravityMain/Poisson/BHTree, which calculates the gravitational potential.

The tree solver algorithm consists of four parts. The first one, communication of block properties, is called only if the AMR grid changes. The other three, building of the tree, communication of the tree and calculation of the potential, are called in each time-step.

Communication of block properties. In recent version, each processor needs to know some basic information about all blocks in the simulation. It includes arrays: nodetype, lrefine and child. These arrays are distributed from each processor to all the other processors. They can occupy a substantial amount of memory on large number of processors (memory required for statically allocated arrays of the tree solver can be calculated by a script tree_mem_use.py).

Building the tree. The global tree is constructed from bottom and the process consists of three steps. In the first one, the so-called block-tree is constructed in each leaf block on a given processor. The block-trees are 1-dimensional dynamically allocated arrays (see Figure 8.12) and pointers to them are stored in array gr_bhTreeArray. In the second step, top nodes of block-trees (corresponding to whole blocks) are distributed to all processors and stored in array gr_bhTreeParentTree. In the last step, higher nodes of the parent tree are calculated by each processor and stored in the gr_bhTreeParentTree array. At the end, each processor holds information about the global tree down to the level of leaf blocks. During the whole process of tree building, 5 subroutines providing the interface to physical units are called: gr_bhFillBotNode, gr_bhAccBotNode, gr_bhAccNode, gr_bhNormalizeNode and gr_bhPostprocNode (see their auto-documentation and source code for details). Each of them calls a corresponding subroutine of all physical units with a name where the first two letters 'gr' are replaced with the name of the unit (e.g. Gravity_bhFillBotNode).

**Figure 8.12:** Example of a block-tree in case of `nxb`=`nyb`=`nzb`=8 and in case physical units do not store any further information to tree nodes (masses and mass centre positions are included by the tree solver itself).

Communication of the tree. Most of the tree data is contained on the bottom levels in individual block-trees. In order to save memory and communication time, only parts of block-trees that are needed on a given processor are sent to it. The procedure consists of three steps. In the first one, a level down to which each block-tree has to be sent to each processor is determined. For a given block-tree, it is done by evaluating the criterion for the node acceptance (traditionally called multipole acceptance criterion, shortly MAC) for all blocks on a remote processor, searching for the maximum level down to which the evaluated node will be needed on a given remote processor. In the second step, information about the block-tree levels which are going to be communicated is sent to all processors. This information is needed for allocation of arrays in which block-trees are stored on remote processors. In the third step, the block-tree arrays are allocated, all block-trees for a given processor are packed into a single message and the messages are communicated.

The MAC is implemented in subroutine gr_bhMAC which includes only a simple geometrical MAC used also by Barnes & Hut. The node is accepted for calculation if

$\displaystyle \frac{S_\mathrm{node}}{D} < \mathtt{gr\_bhTreeLimAngle} ,$

(8.83)

where $S_\mathrm{node}$ is the node size (defined as the largest edge of the corresponding cuboid) and

is the distance between the node and the point-of-calculation. Additionally, gr_bhMAC checks that the point-of-calculation is not located within the node itself enlarged by factor gr_bhTreeSafeBox. On the top of that, gr_bhMAC calls MACs of physical units and the node is accepted only if all criteria are fulfilled.

Tree walk. The tree is traversed from the top to the bottom, evaluating MAC of each node and in case it is not fulfilled, continuing the tree walk with its children. If the node's MAC is fulfilled, the node is accepted for the calculation and subroutine gr_bhBotNodeContrib or gr_bhNodeContrib is called, depending on whether it is a bottom-most node (i.e. a single grid cell) or higher node, respectively. These subroutines only call the corresponding subroutines of physical units (e.g., Gravity_bhNodeContrib). This is the most CPU-intensive part of the tree solver, it usually takes more than $90\%$ of the total tree solver time. It is completely parallel and it does not include any communication (apart from sending some statistics to the main processor at the end).

The tree solver includes several implementations of the tree walk. The default algorithm is the Barnes-Hut like tree walk in which the whole tree is traversed from the top down to nodes fulfilling MAC for each cell separately. This algorithm is used in case the runtime parameter gr_bhUnifiedTreeWalk is true (default). If it this parameter is set to false, another algorithm is used in which instead of walking the whole tree for each cell individually, MAC is at first evaluated for whole mesh block (interacting with some node). If the node is accepted and if the node is a parent node (i.e. corresponding to whole mesh block), the node is accepted for all cells of the block and the contribution of the node is added to them. However, the node contribution is calculated separately for each cell, because the distance between the node and individual cells differs. The third tree walk algorithm is an implementation of the so called SumSquare MAC described by Salmon & Warren (1994). The tree is traversed using the priority queue, taking contribution of the most important nodes first. This algorithm provides much better error control, however, the implementation in this code version is highly experimental.

The tree solver supports isolated and periodic boundary conditions that can be set independently in each direction. In the latter case, when a node is considered for MAC evaluation and eventually calculation by calling Gravity_bhNodeContrib, periodic copies of the node are checked, and the minimum distance among the node periodic copies is taken in account. This allows for instance to calculate gravitational potential with periodic boundary conditions using the Ewald method (see description of the Gravity unit).

8.10.2.5 Tree Poisson solver unit test

The unit test for the tree gravity solver calculates the gravitational potential of the Bonnor-Ebert sphere (Bonnor, W. B., 1956, MNRAS, 116, 351) and compares it to the analytical potential. The density distribution and the analytical potential are calculated by the python script bes-generator.py. The simulation setup only reads the file with radial profiles of these quantities and sets it on the grid. It also normalizes the analytical potential (adds a constant to it) so that the minimum values of the analytical and numerical potential are the same. The error of the gravitational potential calculated by the tree code is stored in the field array PERR (written into the PlotFile). The maximum absolute and relative errors are written into the log file.

8.10.2.6 Multigrid Poisson solver

This section of the User's Guide is taken from a paper by Paul Ricker, “A Direct Multigrid Poisson Solver for Oct-Tree Adaptive Meshes” (2008). Dr. Ricker wrote an original version of this multigrid algorithm for FLASH2. The Flash Center adapted it to FLASH3.

Structured adaptive mesh refinement provides some challenges for the implementation of effective, parallel multigrid methods. In the case of patch-based meshes, Huang & Greengard (2000) presents an algorithm which works by using the coarse-grid solution to impose boundary values on the fine grid. Discontinuities in the solution caused by jumps in refinement are resolved through iterative calculation of the residual and subsequent correction. While this is not a multigrid method in the standard sense, it still provides significant convergence acceleration.

The adaptation of this method to the FLASH grid structure (Ricker, 2008) requires a few modifications. The original formulation required that there be shared points between the coarse and fine patches. Contrast this with finite-volume, nested-cell, cell-averaged grids as used in FLASH(Figure 8.13). This is overcome by the exchange of guardcells from coarse to fine using monotonic interpolation (Sec:gridinterp) and external boundary extrapolation for the calculation of the residual.

**Figure 8.13:** Contrast between jumps of refinement in meshes used in the original paper (left) and the oct-tree adapted method (right).

Another difference between the method of (Ricker 2008) and Huang & Greengard is that an oct-tree undoubtedly has neighboring blocks of the same refinement, while a patch-based mesh would not. This problem is solved through uniform prolongation of boundaries from coarse-to-fine, with simple relaxation done to eliminate the slight error introduced between adjacent cells.

One final change between the two methods is that the original computes new sources at the boundary between corrections, while the propagation here is done through nested solves on various levels.

The entire algorithm requires that the PARAMESH grid be reset such that all blocks at refinement above some level $\ell$ are set as temporarily nonexistent. This is required so that guardcell filling can occur at only that level, neglecting blocks at a higher level of refinement. This requires some global communication by PARAMESH.

The method requires three basic operators over the solution $\phi$ on the grid: taking the residual, restricting a fine-level function to coarser-level blocks, and prolonging values from the coarse level to the faces of fine level blocks in order to impose boundary values for the fine mesh problems.

The residual is calculated such that:

$\displaystyle R({\bf x}) \equiv 4\pi G \rho({\bf x}) - \nabla^2 \tilde\phi({\bf x}) .$

(8.84)

This is accomplished through the application of the finite difference laplacian, defined on level $\ell$ with length-scales $\Delta x_\ell$ , $\Delta y_\ell$ and $\Delta z_\ell$ .

$\displaystyle % multiple lines needed in equation {\cal D}_\ell\tilde\phi^{b\ell}_{ijk}$	$\displaystyle \equiv$	$\displaystyle {1\over\Delta x_\ell^2}\left(\tilde\phi^{b\ell}_{i+1,jk} - 2\tild... ...ll}_{i,j+1,k} - 2\tilde\phi^{b\ell}_{ijk} + \tilde\phi^{b\ell}_{i,j-1,k}\right)$	(8.85)
		$\displaystyle + {1\over\Delta z_\ell^2}\left(\tilde\phi^{b\ell}_{ij,k+1} - 2\tilde\phi^{b\ell}_{ijk} + \tilde\phi^{b\ell}_{ij,k-1}\right) .$	(8.86)

The restriction operator ${\cal R}_\ell$ for block interior zones is:

$\displaystyle ({\cal R}_\ell\tilde\phi)^{{\cal P}(c),\ell}_{ijk} \equiv {1 \over {2^d}} \sum_{i'j'k'} \tilde\phi^{c,\ell+1}_{i'j'k'} ,$

(8.87)

where the indices

refer to the zones in block

that lie within zone

of block ${\cal P}(c)$ . We apply the restriction operator throughout the interiors of blocks, but its opposite, the prolongation operator ${\cal I}_\ell$ , need only be defined on the edges of blocks, because it is only used to set boundary values for the direct single-block Poisson solver:

$\displaystyle ({\cal I}_\ell\tilde\phi)^{c,\ell+1}_{i'j'k'} \equiv \sum_{p,q,r = -2}^2 \alpha_{i'j'k'pqr}\tilde\phi^{{\cal P}(c),\ell}_{i+p,j+q,k+r}$

(8.88)

When needed, boundary zone values are set as for the difference operator. We use conservative quartic interpolation to set edge values, then solve with homogeneous Dirichlet boundary conditions after using second-order boundary-value elimination. The coefficients $\alpha$ determine the interpolation scheme. For the

face in 3D,

$\displaystyle \alpha_{1/2,j'k'pqr}$	$\displaystyle =$	$\displaystyle \beta_p \gamma_{j'q} \gamma_{k'r}$	(8.89)
$\displaystyle (\beta_p)$	$\displaystyle =$	$\displaystyle \left( -{1\over{12}}, {7\over{12}}, {7\over{12}}, -{1\over{12}}, 0 \right)$
$\displaystyle (\gamma_{j'q})$	$\displaystyle =$	$\displaystyle \left\{\begin{array}{ll}\displaystyle \left( -{3\over{128}}, {{11... ...11}\over{64}}, -{3\over{128}}\right) & j' {\rm even} \end{array} \right.$

Interpolation coefficients are defined analogously for the other faces. Note that we use half-integer zone indices to refer to averages over the faces of a zone; integer zone indices refer to zone averages.

8.10.2.7 The direct solver

In the case of problems with Dirichlet boundary conditions, a -dimensional fast sine transform is used. The transform-space Green's Function for this is:

$\displaystyle G^\ell_{ijk} = -16\pi G\left[ {1\over{\Delta x_\ell^2}}\sin^2\lef... ...) + {1\over{\Delta z_\ell^2}}\sin^2\left({k\pi\over 2n_z}\right)\right]^{-1} .$

(8.90)

However, to be able to use the block solver in a general fashion, we must be able to impose arbitrary boundary conditions per-block. In the case of nonhomogenous Dirichlet boundary values, boundary value elimination may be used to generalize the solver. For instance, at the boundary:

$\displaystyle \rho_{1jk} \rightarrow \rho_{1jk} - {2\over\Delta x_\ell^2}\phi(x_{1/2},y_j,z_k) .$

(8.91)

For periodic problems only the coarsest block must be handled differently; block adjacency for finer levels is handled naturally. The periodic solver uses a real-to-complex FFT with the Green's function:

$\displaystyle G^\ell_{ijk} = \left\{\begin{array}{ll}\displaystyle {-16\pi G\le... ...rm or } k \ne 1 \displaystyle 0\hfill i = j = k = 1 \end{array} \right.$

(8.92)

This solve requires that the source be zero-averaged; otherwise the solution is non-unique. Therefore the source average is subtracted from all blocks. In order to decimate error across same-refinement-level boundaries, Gauss-Seidel relaxations to the outer two layers of zones in each block are done after applying the direct solver to all blocks on a level. With all these components outlined, the overall solve may be described by the following algorithm:

Restrict the source function $4\pi G\rho$ to all levels. Subtract the global average for the periodic case.
Interpolation step: For $\ell$ from 1 to $\ell_{\rm max}$ ,
1. Reset the grid so that $\ell$ is the maximum refinement level
2. Solve ${\cal D}_\ell \tilde\phi^{b\ell}_{ijk} = 4\pi G\rho^{b\ell}_{ijk}$ for all blocks on level $\ell$ .
3. Compute the residual $R^{b\ell}_{ijk} = 4\pi G\rho^{b\ell}_{ijk} - {\cal D}_\ell \tilde\phi^{b\ell}_{ijk}$
4. For each block on level $\ell$ that has children, prolong face values for $\tilde\phi^{b\ell}_{ijk}$ onto each child block.
Residual propagation step: Restrict the residual $R^{b\ell}_{ijk}$ to all levels.
Correction step: Compute the discrete norm of the residual over all leaf-node blocks and divide it by the discrete norm of the source over the same blocks. If the result is greater than a preset threshold value, proceed with a correction step: for each level $\ell$ from 1 to $\ell_{\rm max}$ ,
1. Reset the grid so that $\ell$ is the maximum refinement level
2. Solve ${\cal D}_\ell C^{b\ell}_{ijk} = R^{b\ell}_{ijk}$ for all blocks on level $\ell$ .
3. Overwrite $R^{b\ell}_{ijk}$ with the new residual $R^{b\ell}_{ijk} - {\cal D}_\ell C^{b\ell}_{ijk}$ for all blocks on level $\ell$ .
4. Correct the solution on all leaf-node blocks on level $\ell$ : $\tilde\phi^{b\ell}_{ijk} \rightarrow \tilde\phi^{b\ell}_{ijk} + C^{b\ell}_{ijk}$ .
5. For each block on level $\ell$ that has children, interpolate face boundary values of $C^{b\ell}_{ijk}$ for each child.
If a correction step was performed, return to the residual propagation step.

The above procedure requires storage for $\tilde\phi$ , , , and $\rho$ on each block, for a total storage requirement of values per block. Global communication is required in computing the tolerance-based stopping criterion.

8.10.2.8 A Hybrid Poisson Solver: Interfacing PFFT with Multigrid

We can improve the performance of the Multigrid solver in Section 8.10.2.6 by replacing single block FFTs with a parallel FFT at a specified coarse level, where, the coarse level is any level which is fully refined, i.e. containing blocks that completely cover the computational domain. Currently, we automatically select the maximum refinement level that is fully refined.

There is load imbalance in the Multigrid solver because each processor performs single block FFTs on the blocks it owns. At the coarse levels there are relatively few blocks compared to available processors which means many processors are effectively idle during the coarse level solves. The introduction of PFFT, and creation of a hybrid solver, eliminates some of the coarse level solves.

The performance characteristics of the hybrid solver are described in “Optimization of multigrid based elliptic solver for large scale simulations in the FLASH code” (2012) which is available online at http://onlinelibrary.wiley.com/doi/10.1002/cpe.2821/pdf. Performance results are obtained using the PFFT_PoissonFD unit test.

8.10.3 Using the Poisson solvers

The GridSolvers subunit solves the Poisson equation ((8.9)). Two different elliptic solvers are supplied with FLASH: a multipole solver, suitable for approximately spherical source distributions, and a multigrid solver, which can be used with general source distributions. The multipole solver accepts only isolated boundary conditions, whereas the multigrid solver supports Dirichlet, given-value, Neumann, periodic, and isolated boundary conditions. Boundary conditions for the Poisson solver are specified using an argument to the Grid_solvePoisson routine which can be set from different runtime parameters depending on the physical context in which the Poisson equation is being solved. The Grid_solvePoisson routine is the primary entry point to the Poisson solver module and has the following interface

call Grid_solvePoisson (iSoln, iSrc, bcTypes(6), bcValues(2,6), poisfact) ,

where iSoln and iSrc are the integer-valued indices of the solution and source (density) variables, respectively. bcTypes(6) is an integer array specifying the type of boundary conditions to employ on each of the (up to) 6 sides of the domain. Index 1 corresponds to the -x side of the domain, 2 to +x, 3 to -y, 4 to +y, 5 to -z, and 6 to +z. The following values are accepted in the array

bcTypes	Type of boundary condition
0	Isolated boundaries
1	Periodic boundaries
2	Dirichlet boundaries
3	Neumann boundaries
4	Given-value boundaries

Not all boundary types are supported by all solvers. In this release, bcValues(2,6) is not used and can be filled arbitrarily. Given-value boundaries are treated as Dirichlet boundaries with the boundary values subtracted from the outermost interior cells of the source; for this case the solution variable should contain the boundary values in its first layer of boundary cells on input to Grid_solvePoisson. It should be noted that if PARAMESH is used, the values must be set for all levels. Finally, poisfact is real-valued and indicates the value of $\alpha$ multiplying the source function in ((8.9)).

When solutions found using the Poisson solvers are to be differenced (e.g., in computing the gravitational acceleration), it is strongly recommended that for AMR meshes, quadratic (or better) spatial interpolation at fine-coarse boundaries is chosen. (For PARAMESH, this is automatically the case by default, and is handled correctly for Cartesian as well as the supported curvilinear geometries. But note that the default interpolation implementation may be changed at configuration time with the '-gridinterpolation=...' setup option; and with the default implementation, the interpolation order may be lowered with the interpol_order runtime parameter.) If the order of the gridinterpolation of the mesh is not of at least the same order as the differencing scheme used in places like Gravity_accelOneRow, unphysical forces will be produced at refinement boundaries. Also, using constant or linear grid interpolation may cause the multigrid solver to fail to converge.

8.10.3.1 Multipole (original version)

The poisson/multipole sub-module takes two runtime parameters, listed in Table 8.3. Note that storage and CPU costs scale roughly as the square of mpole_lmax, so it is best to use this module only for nearly spherical matter distributions.

**Table 8.3:** Runtime parameters used with `poisson/multipole`.
Variable	Type	Default	Description
`mpole_lmax`	integer	10	Maximum multipole moment
`quadrant`	logical	`.false.`	Use symmetry to solve a single quadrant in 2D axisymmetric cylindrical () coordinates, instead of a half domain.

8.10.3.2 Multipole (improved version)

To include the new multipole solver in a simulation, the best option is to use the shortcut +newMpole at setup command line, effectively replacing the following setup options :

-with-unit=Grid/GridSolvers/Multipole_new
-with-unit=physics/Gravity/GravityMain/Poisson/Multipole
-without-unit=Grid/GridSolvers/Multipole

The improved multipole solver currently accepts only two setup parameters, either one switching on multithreading:

threadBlockList: enables multithreaded compilation and execution.
threadWithinBlock: enables multithreaded compilation and execution.

The names of these two setup parameters are missleading, since there is only one universal threading strategy used. The use of these two setup parameters is a temporary solution and will be replaced in near future by only one setup parameter.

The improved multipole solver takes several runtime parameters, whose functions are explained in detail below, together with comments about expected time and memory scaling.

mpole_Lmax: The maximum angular moment to be used for the multipole Poisson solver. Depending on the domain geometry, the memory and time scaling factors due to this variable alone are: i) 3D cartesian, 3D cylindrical $\rightarrow (L+1)(L+1)$ , ii) 3D cartesian axisymmetric, 2D cylindrical, 2D spherical $\rightarrow (L+1)$ , iii) 1D spherical $\rightarrow 1$ . Assuming no memory limitations, the multipole solver is numerically stable for very large values. Runs up to for 3D cartesian domains have been performed. For 2D geometries, was the maximum tested.
mpole_2DSymmetryPlane: In 2D coordinates, this runtime parameter enables the user to specify a plane of symmetry along the radial part of the domain coordinates. In effect, this allows a reduction of the computational domain size by one half. The code internally computes the multipole moments as if the other symmetric part is present, i.e. no memory or execution time savings can be achieved by this runtime parameter.
mpole_3DAxisymmetry: Forces rotational invariance around the main () axis in 3D cartesian domains. The assumed rotational invariance in the plane effectively cancels all $m\neq 0$ multipole moments and one can restrict the calculation to the multipole moments only. The time and memory savings compared to a asymmetric 3D cartesian run is thus about a factor of . For 3D cylindrical domains, rotational invariance in the plane is equivalent of setting up the corresponding 2D cylindrical domain, hence this runtime parameter is not honored for 3D cylindrical domains, and the user is informed about the 3D to 2D cylindrical domain reduction possibility.
mpole_DumpMoments: This parameter is meant mainly for debugging purposes. It prints the entire moment array for each radial bin for each time step. This option should be used with care and for small problems only. The output is printed to a text file named 'basenm_dumpMoments.txt', where is the base name given for the output files.
mpole_PrintRadialInfo: This parameter enables showing all detailed radial bin information at each time step. This option is especially useful for optimizing the radial bin sizes. The output is written to the text file 'basenm_printRadialInfo.txt'.
mpole_IgnoreInnerZone: Controls switching on/off the radial inner zone. If it is set .true., the inner zone will not be recognized and all inner zone radii will be treated statistically. This parameter is meant only for performing some error analysis. For production runs it should always be at its default value of false. Otherwise errors will be introduced in calculating the moments near the expansion center.
mpole_InnerZoneSize: The size defining the discrete inner zone. The size is given in terms of the inner zone smallest (atomic) radius, which is determined at each time step by analyzing the domain grid structure around the multipolar origin (expansion center). Only very rarely will this value ever have to be changed. The default setting is very conservative and only under unusual circumstances (ex: highly nonuniform grid around the expansion center) this might be necessary. This value needs to be an integer, as it is used by the code to define dimensions of certain arrays. Note, that by giving this runtime parameter a large integer value (> 1000 for domain refinement levels up to 5) one can enforce the code to use only non-statistical radial bins.
mpole_InnerZoneResolution: Defines the inner zone radial bin size for the inner zone in terms of the inner zone smallest (atomic) radius. Two inner zone radii will be considered different, if they are more than this resolution value apart. A very tiny number (for example $10^{-8}$ ) will result in a complete separation of all inner zone radii into separate radial bins. The default value of should never be surpassed, and any attempt to do so will stop the program with the appropriate information to the user. Likewise with a meaningless resolution value of 0.
mpole_MaxRadialZones: The maximum number of outer radial zones to be used. In contrast to the inner radial zone, the outer radial zones are much more important for the user. Their layout defines the performance of the multipole solver both in cpu time spent and accuracy of the potential obtained at each cell. The default value of 1 outer radial zone at maximum refinement level leads to high accuracy, but at the same time can consume quite a bit of memory, especially for full 3D runs. In these cases the user can specify several outer radial zones each having their own radial bin size determination rule.
mpole_ZoneRadiusFraction_n: The fraction of the maximum domain radius defining the n-th outer zone maximum radial value. The total number of fractions given must match the maximum number of outer radial zones specified and the fractions must be in increasing order and less than unity as we move from the 1st outer zone upwards. The last outer zone must always have a fraction of exactly 1. If not, the code will enforce it.
mpole_ZoneType_n: String value containing the outer radial zone type for the n-th outer zone. If set to 'exponential', the radial equation $r(Q) = s \cdot \Delta r \cdot Q^t$ , defining the upper bound radius of the Q-th radial bin in the n-th outer zone, is used. If set to 'logarithmic', the radial equation $r(Q) = s \cdot \Delta r \cdot (e^{Qt}-1)/(e^t-1)$ is used. In these equations is a local radial bin counting index for each outer zone and are parameters defining size and growth of the outer zone radial bins (see below).
mpole_ZoneScalar_n: The scalar value in the n-th outer radial zone equation $r(Q) = s \cdot \Delta r \cdot Q^t$ or $r(Q) = s \cdot \Delta r \cdot (e^{Qt}-1)/(e^t-1)$ . The scalar is needed to be able to increase (or decrease) the size of the first radial bin with respect to the default smallest outer zone radius $\Delta r$ .
mpole_ZoneExponent_n: The exponent value in the n-th outer radial zone equations $r(Q) = s \cdot \Delta r \cdot Q^t$ or $r(Q) = s \cdot \Delta r \cdot (e^{Qt}-1)/(e^t-1)$ . The exponent controls the growth (shrinkage) in size of each radial bin with increasing bin index. For the first equation, growing will occur for , shrinking for and same size for . For the logarithmic equation, growing will occur for , shrinking for , but the same size option will not work because the denominator becomes undefined. The same size option must hence be treated using the exponential outer zone type choice.

Runtime parameter types, defaults and options:

Parameter	Type	Default	Options
`mpole_Lmax`	integer	0
`mpole_2DSymmetryPlane`	logical	false	true
`mpole_3DAxisymmetry`	logical	false	true
`mpole_DumpMoments`	logical	false	true
`mpole_PrintRadialInfo`	logical	false	true
`mpole_IgnoreInnerZone`	logical	false	true
`mpole_InnerZoneSize`	integer	16
`mpole_InnerZoneResolution`	real	0.1	less than and
`mpole_MaxRadialZones`	integer	1
`mpole_ZoneRadiusFraction_n`	real	1.0	less than and
`mpole_ZoneType_n`	string	”exponential”	”logarithmic”
`mpole_ZoneScalar_n`	real	1.0
`mpole_ZoneExponent_n`	real	1.0	(exponential)
	real	-	any $\neq 0$ (logarithmic)

8.10.3.3 Tree Poisson solver

The tree gravity solver can be included by setup or a Config file by requesting

physics/Gravity/GravityMain/Poisson/BHTree

The current implementation works only in 3D Cartesian coordinates, and blocks have to be logically cubic (i.e., nxb=nyb=nzb). Physical dimensions of blocks can be arbitrary, however, some multipole acceptance criteria can provide inaccurate error estimates with non-cubic blocks. The computational domain can have arbitrary dimensions, and there can be more blocks with lrefine=1 (i.e., nblockx, nblocky and nblockz can have different values).

Runtime parameters gr_bhPhysMACTW and gr_bhPhysMACComm control whether MACs of physical units are used in tree walk and communication, respectively. If one of them (or both) is set .false., only purely geometric MAC is used for a corresponding part of the tree solver. It is not allowed to set gr_bhPhysMACTW = .false. and gr_bhPhysMACComm = .true..

Runtime parameter gr_bhTreeLimAngle allows to set the limit opening angle for the purely geometrical MAC. Another condition controlling the acceptance of the node for the calculations is that the point-of-calculation must lie out of the box obtained by increasing the considered node by factor gr_bhTreeSafeBox.

Parameter gr_bhUseUnifiedTW controls whether the Barnes-Hut like tree walk algorithm is used (.true.) or whether an alternative algorithm is used which checks the MAC only once for whole block for interactions with parent blocks (.false.; see 8.10.2.4 for more details). The latter one is $10 - 20\%$ faster, however, it may lead to higher errors at block boundaries, in particular if the gravity modules calculates the potential which is subsequently differentiated to obtain gravitational acceleration. The tree walk algorithm base on the priority queue is used if grv_bhMAC is set to "SumSquare".

Variable Type Default Description

gr_bhPhysMACTW logical .false. whether physical MAC should be used in tree walk

gr_bhPhysMACComm logical .false. whether physical MAC should be used in communication

gr_bhTreeLimAngle real limiting opening angle

gr_bhTreeSafeBox real relative size of restricted volume around node where the

point-of-calculation is not allowed to be located

gr_bhUseUnifiedTW logical .true. whether Barnes-Hut like tree walk algorithm should be used

gr_bhTWMaxQueueSize integer .true. maximum length of the priority queue

8.10.3.4 Multigrid

The Grid/GridSolvers/Multigrid module is appropriate for general source distributions. It solves Poisson's equation for 1, 2, and 3 dimensional problems with Cartesian geometries. It only supports the PARAMESH Grid with one block at the coarsest level. For any other mesh configuration it is advisable to use the hybrid solver, which switches to a uniform grid exact solve when the specified level of coarsening has been achieved. In most use cases for FLASH, the multigrid solver will be used to solve for Gravity (see: chp:Gravity). It may be included by setup or Config by including:

physics/Gravity/GravityMain/Poisson/Multigrid

The multigrid solver may also be included stand-alone using:

Grid/GridSolvers/Multigrid

In which case the interface is as described above. The supported boundary conditions for the module are periodic, Dirichlet, given-value, and isolated. Due to the nature of the FFT block solver, the same type of boundary condition must be used in all directions. Therefore, only the value of bcTypes(1) will be considered in the call to Grid_solvePoisson.

The multigrid solver requires the use of two internally-used grid variables: isls and icor. These are used to store the calculated residual and solved-for correction, respectively. If it is used as a Gravity solver with isolated boundary conditions, then two additional grid variables, imgm and imgp, are used to store the image mass and image potential.

**Table 8.4:** Runtime parameters used with `Grid/GridSolvers/Multigrid`.
Variable	Type	Default	Description
`mg_MaxResidualNorm`	real	$1\times 10^{-6}$	Maximum ratio of the norm of the residual to that of the right-hand side
`mg_maxCorrections`	integer	100	Maximum number of iterations to take
`mg_printNorm`	real	`.true.`	Print the norm ratio per-iteration
`mpole_lmax`	integer	4	The number of multipole moments used in the isolated case

8.10.3.5 Hybrid (Multigrid with PFFT)

The hybrid solver can be used in place of the Multigrid solver for problems with

all-periodic
2 periodic and 1 Neumann
1 periodic and 2 Neumann

boundary conditions, if the default PFFT solver variant (called DirectSolver) is used. To use the hybrid solver in this way, add Grid/GridSolvers/Multigrid/PfftTopLevelSolve to your setup line or request the solver in your Simulation Config file (see e.g. unitTest/PFFT_PoissonFD). The following setup lines create a unit test that uses first the hybrid solver and then the standard Multigrid solver

./setup unitTest/PFFT_PoissonFD -auto -3d -parfile=flash_pm_3d.par -maxblocks=800 +noio


./setup unitTest/PFFT_PoissonFD -auto -3d -parfile=flash_pm_3d.par -maxblocks=800 +noio
-without-unit=Grid/GridSolvers/Multigrid/PfftTopLevelSolve
-with-unit=Grid/GridSolvers/Multigrid

It is also possible to select a different PFFT solver variant. In that case, different combinations of boundary conditions for the Poisson problem may be supported. The HomBcTrigSolver variant supports the largest set of combinations of boundary conditions. Use the PfftSolver setup variable to choose a variant. Thus, appending PfftSolver=HomBcTrigSolver to the setup chooses the HomBcTrigSolver variant. When using the hybrid solver with the PFFT variants HomBcTrigSolver or SimplePeriodicSolver, the runtime parameter gr_pfftDiffOpDiscretize should be set to 1.

The Multigrid runtime parameters given in the previous section also apply.

8.10.4 HYPRE

As a part of implicit time advancement we end up with a system of equations that needs to be solved at every time step. In FLASH4 the HYPRE linear algebra package is used to solve these systems of equations. Therefore it is necessary to install Hypre if this capability of FLASH is to be used.

Grid_advanceDiffusion is the API function which solves the system of equations. This API is provided by both the split and unsplit solvers. The unsplit solver uses HYPRE to solve the system of equations and split solver does a direct inverse using Thomas algorithm. Note that the split solver relies heavily on PFFT infrastructure for data exchange and a significant portion of work in split Grid_advanceDiffusion involves PFFT routines. In the unsplit solver the data exchange is implicitly done within HYPRE and is hidden.

The steps in unsplit Grid_advanceDiffusion are as follows:

Setup HYPRE grid object
Exchange Factor B
Set initial guess
Compute HYPRE Matrix M such that B = MX
Compute RHS Vector B
Compute matrix A
Solve system AX = B
Update solution (in FLASH4)

Mapping UG grid to HYPRE matrix is trivial, however mapping PARAMESH grid to a HYPRE matrix can be quite complicated. The process is simplified using the grid interfaces provided by HYPRE.

Struct Grid interface
SStruct Grid interface
IJ System interface

The choice of an interface is tightly coupled to the underlying grid on which the problem is being solved. We have chosen the SSTRUCT interface as it is the most compatible with the block structured AMR mesh in FLASH4. Two terms commonly used in HYPRE terminology are part and box. We define these terms in equivalent FLASH4 terminology. A HYPRE box object maps directly to a leaf block in FLASH4. The block is then defined by it's extents. In FLASH4 this information can be computed easily using a combination of Grid_getBlkCornerID and Grid_getBlkIndexLimits.All leaf blocks at a given refinement level form a HYPRE part. So number of parts in a typical FLASH4 grid would be give by,

nparts = leaf_block(lrefine_max) - leaf_block(lrefine_min) + 1

So, if a grid is fully refined or UG, nparts = 1. However, there could still be more then one box object.

Setting up the HYPRE grid object is one of the most important step of the solution process. We use the SSTRUCT interface provided in HYPRE to setup the grid object. Since the HYPRE Grid object is mapped directly with FLASH4 grid, whenever the FLASH4 grid changes the HYPRE grid object needs to be updated. Consequentlywith AMR the HYPRE grid setup might happen multiple times.

Setting up a HYPRE grid object is a two step process,

Creating stenciled relationships.
Creating Graph relationships.

Stenciled relationships typically exist between leaf blocks at same refinement level (intra part) and graph relationships exist between leaf blocks at different refinement levels (inter part). The fine-coarse boundary is handled in such a way that fluxes are conserved at the interface (see Chp:diffuse for details). UG does not require any graph relationships.

Whether a block needs a graph relationship depends on the refinement level of it's neighbor. While this information is not directly available in PARAMESH, it is possible to determine whether the block neighbor is coarser or finer. Combining this information with the constraint of at best a factor of two jump in refinement at block boundaries, it is possible to compute the part number of a neighbor block, which in turn determines whether we need a graph. Creating a graph involves creating a link between all the cells on the block boundary.

Once the grid object is created, the matrix and vector objects are built on the grid object. The diffusion solve needs uninterpolated data from neighbor blocks even when there is a fine-coarse boundary, therefore it cannot rely upon the guardcell fill process. A two step process is used to handle this situation,

Since HYPRE has access to X(at n, i.e., initial guess), the RHS vector B can be computed as MX where M is a modified Matrix.
Similarly the value of Factor B can be shared across the fine-coarse boundary by using
Grid_conserveFluxes,the fluxes need to be set in a intuitive to way to achieve the desired effect.

With the computation of Vector B (RHS), the system can be solved using HYPRE and UNK can be updated.

8.10.4.1 HYPRE Solvers

In FLASH4 we use the HYPRE PARCSR storage format as this exposes the maximum number of iterative solvers.

**Table 8.5:** Solvers, Preconditioners combinations used with `Grid/GridSolvers/HYPRE`.
Solver	Preconditioner
PCG	AMG, ILU(0)
BICGSTAB	AMG, ILU(0)
GMRES	AMG, ILU(0)
AMG	-
SPLIT	-

Parallel runs: One issue that has been observed is that there is a difference in the results produced by HYPRE using one or more processors. This would most likely be due to use of CG (krylov subspace methods), which involves an MPI_SUM over the dot product of the residue. We have found this error to be dependent on the type of problem. One way to get across this problem is to use direct solvers in HYPRE like SPLIT. However we have noticed that direct solvers tend to be slow. THe released code has an option to use the SPLIT solver, but this solver has not been extensively tested and was used only for internal debugging purposes and the usage of the HYPRE SPLIT solver in FLASH4 is purely experimental.

Customizing solvers: HYPRE exposes a lot more parameters to tweak the solvers and preconditioners mentioned above. We have only used those which are applicable to general diffusion problems. Although in general these settings might be good enough it is by no means complete and might not be applicable to all class of problems. Use of additional HYPRE parameters might require direct manipulation of FLASH4 code.

Symmetric Positive Definite (SPD) Matrix: PCG has been noticed to have convergence issues which might be related to (not necessarily),

A non-SPD matrix generated due to round of errors (only).
Use of BoomerAMG as PC (refer to HYPRE manual).

The default settings use PCG as the solver with AMG as preconditioner. The following parameters can be tweaked at run time,

**Table 8.6:** Runtime parameters used with `Grid/GridSolvers/HYPRE`.
Variable	Type	Default	Description
`gr_hyprePCType`	string	`"hypre_amg"`	Algebraic Multigrid as Preconditioner
`gr_hypreMaxIter`	integer	10000	Maximum number of iterations
`gr_hypreRelTol`	real	$1\times 10^{-8}$	Maximum ratio of the norm of the residual to that of the initial residue
`gr_hypreSolverType`	string	`"hypre_pcg"`	Type of linear solver, Preconditioned Conjugate gradient
`gr_hyprePrintSolveInfo`	boolean	FALSE	enables/disables some basic solver information (for e.g number of iterations)
`gr_hypreInfoLevel`	integer	1	Verbosity level of solver diagnostics (refer HYPRE manual).
`gr_hypreFloor`	real	$1\times 10^{-12}$	Used to floor the end solution.
`gr_hypreUseFloor`	boolean	TRUE	whether to apply gr_hypreFloor to floor results from HYPRE.

8.10.4.2 Unified Hypre Solver

Often the systems of equations that we want to solve with HYPRE are some variation of parabolic PDES that can be expressed genericaly as

$\displaystyle A_i \frac{\partial U_i}{\partial t} = C_{ij}U^*_j + \partial_k \l... ...ijkl} \partial_l U^*_j + E_{ijk} U^*_j \right) + D_i + F_{ijk} \partial_k U^*_j$

(8.93)

Typically these derivatives need to be discretized in the same manner independent of the system of equations with the only differences being in the factors and transport coefficients used in the tensors , $B_{ijkl}$ , $C_{ij}$ , , $E_{ijk}$ & $F_{ijk}$ . FLASH4.8 introduces a unified HYPRE solver that will construct all the necessary HYPRE objects, solve the system equations given the tensors in Eq. 8.93, and update the Grid variables. This is supported for uniform and paramesh grids.

Previous versions of the HYPRE solvers, as described above, shared a single instance of all the HYPRE objects. The unified solver described in this section maintains separate objects for each solver that are maintained in source/Grid/GridSolvers/HYPRE/Unified/gr_uhypreData. New solvers are registered at setup in source/Grid/GridSolvers/HYPRE/Unified/Config. This will also generate runtime parameters unique to each solver as described in Table 8.6 replacing gr_hypre $\rightarrow$ gr_hypreSolver_.

**Table 8.7:** Setup generated runtime parameters for each unified HYPRE solver created. For a given solver, `Solver`, the actual runtime parameters will be prepended with `gr_uhypreSolver_`.
Parameter	Type	Default	Description
`SolverType`	STRING	"HYPRE_PCG"	Runtime parameters used with `Grid/GridSolvers/HYPRE`.
`PCType`	STRING	"HYPRE_AMG"	Algebraic Multigrid as Preconditioner
`RelTol`	REAL	1.0e-8	Maximum ratio of the norm of the residual to that of the initial residue
`AbsTol`	REAL	0.0
`SolverAutoAbsTolFact`	REAL	0.0
`MinIter`	INTEGER	0	Minimum number of iterations for HYPRE solve phase
`MaxIter`	INTEGER	500	Maximum number of iterations for HYPRE solve phase
`PrintSolveInfo`	BOOLEAN	FALSE	When `.true.` prints solve info from HYPRE
`InfoLevel`	INTEGER	1	level of output when using `PrintSolveInfo`
`FloorType`	INTEGER	0
`Floor`	REAL	1.0e-12	Used to floor the end solution.
`Use2Norm`	BOOLEAN	FALSE
`CfTol`	REAL	0.0
`RecomputeResidual`	BOOLEAN	FALSE
`RecomputeResidualP`	INTEGER	-1
`RelChange`	BOOLEAN	FALSE
`SlopeLimType`	STRING	"HYPRESL_MC"

Systems are solved with calls to Grid_advanceImplicit by passing an instance of parabolicProblemParms from the Grid_parabolicProblem module, whose parameters are detailed in Table 8.8. This type contains all the information about the system of equations described by Eq. 8.93. The various tensors are described by multidimensional arrays, such as FactorE whose elements will also be multiplied by a variable in unk through iFactorE, which should match the index in iFactorBMap.

**Table 8.8:** `parabolicProblemParms` type from `Grid_parabolicProblem`
Variable	Type	Description
iFactorBMap	integer(`ncoeffs`)	Unique indices in solnvec for transport coefficients
iFactorA	integer(`nvars`)	Stores indices into solnvec for LHS term (weighting factors)
FactorA	real(`nvars`)	Constant coefficient for LHS term (weighting factor)
iFactorB	integer(`nvars,nvars,NDIM,NDIM`)	Indices in iFactorBMap for transport coefficients
FactorB	real(`nvars,nvars,NDIM,NDIM`)	Constant coefficients for transport/gradient terms
iFactorC	integer(`nvars,nvars`)	Indices in solnvec with inter variable coupling coefficients
FactorC	real(`nvars,nvars`)	Constant coefficients for coupling coefficients
iFactorD	integer(`nvars`)	Indices in solnvec to map to
FactorD	real(`nvars`)	Constant coefficient for each variable to multiply constant source
iFactorE	integer(`nvars,nvars,NDIM`)	Indices in solnvec to map to
FactorE	real(`nvars,nvars,NDIM`)	Constant coefficients for transport/gradient terms involving geometric source terms
iFactorF	integer(`nvars,nvars,NDIM`)	Indices in solnvec to map to
FactorF	real(`nvars,nvars,NDIM`)	Constant coefficients for transport/gradient terms involving geometric source terms
iFactorG	integer(`nvars,nvars`)	Indices in solnvec with inter variable coupling coefficients
FactorG	real(`nvars,nvars`)	Constant coefficients for coupling coefficients
iVar	integer(`nvars`)	Stores indices into solnvec for variables to solve
iFluxMap	integer(`nvars`)	Stores unique indices into solnvec for flux limiting variable
iFlux	integer(`nvars`)	Stores indices in iFluxMap for each variable's maximum flux
Flux	real(`nvars`)	Stores indices in iFluxMap for each variable's maximum flux
mode	integer(`nvars`)	Flux limiter mode to use for each variable
bcTypes	integer(`NDIM,nvars,2`)	Boundary condition flag
bcValues	real(`NDIM,nvars,2`)	Value or slope at boundary - when using Neumann or Dirichlet BC's
theta	real	Implicitness of solve
iSolution	integer	Handle so we can keep track of what equation we are solving. Should match a system `PPDEFINE` from `GridSolvers/HYPRE/Unified/Config`

Eq. 8.93 is evolved with a finite-volume discretization. We integrate Eq. 8.93 over a control volume $V_\alpha$ , centered on the point $\mathbf{x}_\alpha$ , to get

$\displaystyle V_\alpha A_i \left . \frac{\partial U_i}{\partial t} \right \vert... ...a D_i + \left [ V_\alpha F_{ijk} \partial_k U^*_j \right ]_{\mathbf{x}_\alpha},$

(8.94)

where the divergence theorem has been used to turn the -factor term to be an integral over the face of our control volume and any remaining volume integrals are approximated by the value at the center of control volume $\mathbf{x}_\alpha$ . The -factors require derivatives integrated over the faces of mesh and the -factors need derivatives evaluated at the cell-centers.

There are two classes of derivatives that could be need at a face, either derivatives normal to the face or those that are transverse to the face. In both cases finite-difference weights on a stencil of neighboring points are used, but differ in the approximation used for the surface integration. In the case of normal derivatives the stencil is constructed from cells that share the face of interest, and the derivative is evaluated at the center of the face, allowing a mid-point quadrature for the surface integral on that face.

$\displaystyle \int_{A_{i+1/2}}B\partial_l U dA_{l} = B_{i+1/2}\partial_l U ({\mathbf{x}_{i+1/2}}) A_{i+1/2} + \mathcal{O}(\Delta^2)$

(8.95)

A stencil that could approximate derivatives transverse to a cell-face would require neighboring cells that share a common edge, introducing a new coupling between diagnoal elements, for example $U_{ij} \leftarrow U_{i+1,j+1}$ . If the face-integrals are approximated with a mid-point quadrature, like in Eq. 8.97, the diagonal coupling, through the face coefficients $B_{i+1/2,j}$ & $B_{i+1,j+1/2}$ is no longer symmetric if the transport coefficients, , vary in space

$\begin{displaymath}\begin{array}{cc} U_{ij} \xleftarrow{B_{i+1/2,j}} U_{i+1,j+1} & U_{i+1,j+1} \xleftarrow{B_{i+1,j+1/2}} U_{ij}. \end{array}\end{displaymath}$

(8.96)

The resulting systems of equations are numerically unstable, particularly when $B_{i+1,j+1/2} » B_{i+1/2,j}$ . In order to symmetrize the discretization for these new edge-couplings we approximate the face-integrals with a two point trapezoid rule in the transverse direction

$\displaystyle \int_{A_{i+1/2}}B\partial_l U dA_{l} \approx \frac{1}{2}\left ( B... ...U_{i+1/2,j+1/2} + B_{i+1/2,j-1/2}\partial_l U _{i+1/2,j-1/2}. \right )A_{i+1/2}$

(8.97)

This way the edge-coupling is always symmetric through the edge-centered transport coefficients $B_{i+1/2,j+1/2}$ .

For both edge-centered and face-centered derivatives the Unified HYPRE solver computes stencil weights for all possible refinement configurations at fine-coarse boundaries and handles the creation of graph objects between parts.

8.10.4.3 Curvilinear Coordinates

For scalar systems of equations the finite-volume formulism in Eq. 8.94 naturally incorporates any grid geometry through the volume and area factors. However, for vector systems of equations, such as velocity with viscosity or magnetic field with resistive MHD, the discretization needs to account for geometric scale factors introduced through the vector calculus operators. The Unified HYPRE solver currently only handles "cylindrical" geometries, where it is often convenient to express the generic system of parabolic equations as

$\displaystyle A_i \frac{\partial U_i}{\partial t} = C_{ij}U^*_j + D_i + \frac{1... ...U^*_j \right) + \frac{1}{r}F_{ijk} \partial_k U^*_j + \frac{1}{r^2}G_{ij} U^*_j$

(8.98)

Just as before the second derivative & -factor terms can be integrated with the divergence theorem to be evaluated on the cell-faces, but the volume & -factor terms will include a $\frac{1}{r}$ geometric term.

8.10 GridSolvers

8.10.1 Pfft

8.10.1.1 Using Pfft

8.10.1.2 Pfft data movement subroutines

8.10.1.3 Unit Test

8.10.2 Poisson equation

8.10.2.1 Multipole Poisson solver (original version)

8.10.2.2 Multipole Poisson solver (improved version)

8.10.2.3 Multipole Poisson solver unit test (MacLaurin spheroid)

8.10.2.4 Tree Poisson Solver

8.10.2.5 Tree Poisson solver unit test

8.10.2.6 Multigrid Poisson solver

8.10.2.7 The direct solver

8.10.2.8 A Hybrid Poisson Solver: Interfacing PFFT with Multigrid

8.10.3 Using the Poisson solvers

8.10.3.1 Multipole (original version)

8.10.3.2 Multipole (improved version)

8.10.3.3 Tree Poisson solver

8.10.3.4 Multigrid

8.10.3.5 Hybrid (Multigrid with PFFT)

8.10.4 HYPRE

8.10.4.1 HYPRE Solvers

8.10.4.2 Unified Hypre Solver

8.10.4.3 Curvilinear Coordinates

8.10.1.1 Using `Pfft`

8.10.1.2 `Pfft` data movement subroutines