step-67/doc/intro.dox

<br>

<i>
This program was contributed by Martin Kronbichler. Many ideas presented here
are the result of common code development with Niklas Fehn, Katharina Kormann,
Peter Munch, and Svenja Schoeder.

This work was partly supported by the German Research Foundation (DFG) through
the project "High-order discontinuous Galerkin for the exa-scale" (ExaDG)
within the priority program "Software for Exascale Computing" (SPPEXA).
</i>

<a name="Intro"></a>
<h1>Introduction</h1>

This tutorial program solves the Euler equations of fluid dynamics using an
explicit time integrator with the matrix-free framework applied to a
high-order discontinuous Galerkin discretization in space. For details about
the Euler system and an alternative implicit approach, we also refer to the
step-33 tutorial program. You might also want to look at step-69 for
an alternative approach to solving these equations.


<h3>The Euler equations</h3>

The Euler equations are a conservation law, describing the motion of a
compressible inviscid gas,
@f[
\frac{\partial \mathbf{w}}{\partial t} + \nabla \cdot \mathbf{F}(\mathbf{w}) =
\mathbf{G}(\mathbf w),
@f]
where the $d+2$ components of the solution vector are $\mathbf{w}=(\rho, \rho
u_1,\ldots,\rho u_d,E)^{\mathrm T}$. Here, $\rho$ denotes the fluid density,
${\mathbf u}=(u_1,\ldots, u_d)^\mathrm T$ the fluid velocity, and $E$ the
energy density of the gas. The velocity is not directly solved for, but rather
the variable $\rho \mathbf{u}$, the linear momentum (since this is the
conserved quantity).

The Euler flux function, a $(d+2)\times d$ matrix, is defined as
@f[
  \mathbf F(\mathbf w)
  =
  \begin{pmatrix}
  \rho \mathbf{u}\\
  \rho \mathbf{u} \otimes \mathbf{u} + \mathbb{I}p\\
  (E+p)\mathbf{u}
  \end{pmatrix}
@f]
with $\mathbb{I}$ the $d\times d$ identity matrix and $\otimes$ the outer
product; its components denote the mass, momentum, and energy fluxes, respectively.
The right hand side forcing is given by
@f[
  \mathbf G(\mathbf w)
  =
  \begin{pmatrix}
  0\\
  \rho\mathbf{g}\\
  \rho \mathbf{u} \cdot \mathbf{g}
  \end{pmatrix},
@f]
where the vector $\mathbf g$ denotes the direction and magnitude of
gravity. It could, however, also denote any other external force per unit mass
that is acting on the fluid. (Think, for example, of the electrostatic
forces exerted by an external electric field on charged particles.)

The three blocks of equations, the second involving $d$ components, describe
the conservation of mass, momentum, and energy. The pressure is not a
solution variable but needs to be expressed through a "closure relationship"
by the other variables; we here choose the relationship appropriate
for a gas with molecules composed of two atoms, which at moderate
temperatures is given by $p=(\gamma - 1) \left(E-\frac 12 \rho
\mathbf{u}\cdot \mathbf{u}\right)$ with the constant $\gamma = 1.4$.


<h3>High-order discontinuous Galerkin discretization</h3>

For spatial discretization, we use a high-order discontinuous Galerkin (DG)
discretization, using a solution expansion of the form
@f[
\mathbf{w}_h(\mathbf{x}, t) =
\sum_{j=1}^{n_\mathbf{dofs}} \boldsymbol{\varphi}_j(\mathbf{x}) {w}_j(t).
@f]
Here, $\boldsymbol{\varphi}_j$ denotes the $j$th basis function, written
in vector form with separate shape functions for the different components and
letting $w_j(t)$ go through the density, momentum, and energy variables,
respectively. In this form, the space dependence is contained in the shape
functions and the time dependence in the unknown coefficients $w_j$. As
opposed to the continuous finite element method where some shape functions
span across element boundaries, the shape functions are local to a single
element in DG methods, with a discontinuity from one element to the next. The
connection of the solution from one cell to its neighbors is instead
imposed by the numerical fluxes
specified below. This allows for some additional flexibility, for example to
introduce directionality in the numerical method by, e.g., upwinding.

DG methods are popular methods for solving problems of transport character
because they combine low dispersion errors with controllable dissipation on
barely resolved scales. This makes them particularly attractive for simulation
in the field of fluid dynamics where a wide range of active scales needs to be
represented and inadequately resolved features are prone to disturb the
important well-resolved features. Furthermore, high-order DG methods are
well-suited for modern hardware with the right implementation. At the same
time, DG methods are no silver bullet. In particular when the solution
develops discontinuities (shocks), as is typical for the Euler equations in
some flow regimes, high-order DG methods tend to oscillatory solutions, like
all high-order methods when not using flux- or slope-limiters. This is a consequence of <a
href="https://en.wikipedia.org/wiki/Godunov%27s_theorem">Godunov's theorem</a>
that states that any total variation limited (TVD) scheme that is linear (like
a basic DG discretization) can at most be first-order accurate. Put
differently, since DG methods aim for higher order accuracy, they cannot be
TVD on solutions that develop shocks. Even though some communities claim that
the numerical flux in DG methods can control dissipation, this is of limited
value unless <b>all</b> shocks in a problem align with cell boundaries. Any
shock that passes through the interior of cells will again produce oscillatory
components due to the high-order polynomials. In the finite element and DG
communities, there exist a number of different approaches to deal with shocks,
for example the introduction of artificial diffusion on troubled cells (using
a troubled-cell indicator based e.g. on a modal decomposition of the
solution), a switch to dissipative low-order finite volume methods on a
subgrid, or the addition of some limiting procedures. Given the ample
possibilities in this context, combined with the considerable implementation
effort, we here refrain from the regime of the Euler equations with pronounced
shocks, and rather concentrate on the regime of subsonic flows with wave-like
phenomena. For a method that works well with shocks (but is more expensive per
unknown), we refer to the step-69 tutorial program.

For the derivation of the DG formulation, we multiply the Euler equations with
test functions $\mathbf{v}$ and integrate over an individual cell $K$, which
gives
@f[
\left(\mathbf{v}, \frac{\partial \mathbf{w}}{\partial t}\right)_{K}
+ \left(\mathbf{v}, \nabla \cdot \mathbf{F}(\mathbf{w})\right)_{K} =
\left(\mathbf{v},\mathbf{G}(\mathbf w)\right)_{K}.
@f]

We then integrate the second term by parts, moving the divergence
from the solution slot to the test function slot, and producing an integral
over the element boundary:
@f[
\left(\mathbf{v}, \frac{\partial \mathbf{w}}{\partial t}\right)_{K}
- \left(\nabla \mathbf{v}, \mathbf{F}(\mathbf{w})\right)_{K}
+ \left<\mathbf{v}, \mathbf{n} \cdot \widehat{\mathbf{F}}(\mathbf{w})
\right>_{\partial K} =
\left(\mathbf{v},\mathbf{G}(\mathbf w)\right)_{K}.
@f]
In the surface integral, we have replaced the term $\mathbf{F}(\mathbf w)$ by
the term $\widehat{\mathbf{F}}(\mathbf w)$, the numerical flux. The role of
the numerical flux is to connect the solution on neighboring elements and
weakly impose continuity of the solution. This ensures that the global
coupling of the PDE is reflected in the discretization, despite independent
basis functions on the cells. The connectivity to the neighbor is included by
defining the numerical flux as a function $\widehat{\mathbf{F}}(\mathbf w^-,
\mathbf w^+)$ of the solution from both sides of an interior face, $\mathbf
w^-$ and $\mathbf w^+$. A basic property we require is that the numerical flux
needs to be <b>conservative</b>. That is, we want all information (i.e.,
mass, momentum, and energy) that leaves a cell over
a face to enter the neighboring cell in its entirety and vice versa. This can
be expressed as $\widehat{\mathbf{F}}(\mathbf w^-, \mathbf w^+) =
\widehat{\mathbf{F}}(\mathbf w^+, \mathbf w^-)$, meaning that the numerical
flux evaluates to the same result from either side. Combined with the fact
that the numerical flux is multiplied by the unit outer normal vector on the
face under consideration, which points in opposite direction from the two
sides, we see that the conservation is fulfilled. An alternative point of view
of the numerical flux is as a single-valued intermediate state that links the
solution weakly from both sides.

There is a large number of numerical flux functions available, also called
Riemann solvers. For the Euler equations, there exist so-called exact Riemann
solvers -- meaning that the states from both sides are combined in a way that
is consistent with the Euler equations along a discontinuity -- and
approximate Riemann solvers, which violate some physical properties and rely
on other mechanisms to render the scheme accurate overall. Approxiate Riemann
solvers have the advantage of beging cheaper to compute. Most flux functions
have their origin in the finite volume community, which are similar to DG
methods with polynomial degree 0 within the cells (called volumes). As the
volume integral of the Euler operator $\mathbf{F}$ would disappear for
constant solution and test functions, the numerical flux must fully represent
the physical operator, explaining why there has been a large body of research
in that community. For DG methods, consistency is guaranteed by higher order
polynomials within the cells, making the numerical flux less of an issue and
usually affecting only the convergence rate, e.g., whether the solution
converges as $\mathcal O(h^p)$, $\mathcal O(h^{p+1/2})$ or $\mathcal
O(h^{p+1})$ in the $L_2$ norm for polynomials of degree $p$. The numerical
flux can thus be seen as a mechanism to select more advantageous
dissipation/dispersion properties or regarding the extremal eigenvalue of the
discretized and linearized operator, which affect the maximal admissible time
step size in explicit time integrators.

In this tutorial program, we implement two variants of fluxes that can be
controlled via a switch in the program (of course, it would be easy to make
them a run time parameter controlled via an input file). The first flux is
the local Lax--Friedrichs flux
@f[
\hat{\mathbf{F}}(\mathbf{w}^-,\mathbf{w}^+) =
\frac{\mathbf{F}(\mathbf{w}^-)+\mathbf{F}(\mathbf{w}^+)}{2} +
   \frac{\lambda}{2}\left[\mathbf{w}^--\mathbf{w}^+\right]\otimes
   \mathbf{n^-}.
@f]

In the original definition of the Lax--Friedrichs flux, a factor $\lambda =
\max\left(\|\mathbf{u}^-\|+c^-, \|\mathbf{u}^+\|+c^+\right)$ is used
(corresponding to the maximal speed at which information is moving on
the two sides of the interface), stating
that the difference between the two states, $[\![\mathbf{w}]\!]$ is penalized
by the largest eigenvalue in the Euler flux, which is $\|\mathbf{u}\|+c$,
where $c=\sqrt{\gamma p / \rho}$ is the speed of sound. In the implementation
below, we modify the penalty term somewhat, given that the penalty is of
approximate nature anyway. We use
@f{align*}{
\lambda
&=
\frac{1}{2}\max\left(\sqrt{\|\mathbf{u^-}\|^2+(c^-)^2},
                     \sqrt{\|\mathbf{u}^+\|^2+(c^+)^2}\right)
\\
&=
\frac{1}{2}\sqrt{\max\left(\|\mathbf{u^-}\|^2+(c^-)^2,
                           \|\mathbf{u}^+\|^2+(c^+)^2\right)}.
@f}
The additional factor $\frac 12$ reduces the penalty strength (which results
in a reduced negative real part of the eigenvalues, and thus increases the
admissible time step size). Using the squares within the sums allows us to
reduce the number of expensive square root operations, which is 4 for the
original Lax--Friedrichs definition, to a single one.
This simplification leads to at most a factor of
2 in the reduction of the parameter $\lambda$, since $\|\mathbf{u}\|^2+c^2 \leq
\|\mathbf{u}\|^2+2 c |\mathbf{u}\| + c^2 = \left(\|\mathbf{u}\|+c\right)^2
\leq 2 \left(\|\mathbf{u}\|^2+c^2\right)$, with the last inequality following
from Young's inequality.

The second numerical flux is one proposed by Harten, Lax and van Leer, called
the HLL flux. It takes the different directions of propagation of the Euler
equations into account, depending on the speed of sound. It utilizes some
intermediate states $\bar{\mathbf{u}}$ and $\bar{c}$ to define the two
branches $s^\mathrm{p} = \max\left(0, \bar{\mathbf{u}}\cdot \mathbf{n} +
\bar{c}\right)$ and $s^\mathrm{n} = \min\left(0, \bar{\mathbf{u}}\cdot
\mathbf{n} - \bar{c}\right)$. From these branches, one then defines the flux
@f[
\hat{\mathbf{F}}(\mathbf{w}^-,\mathbf{w}^+) =
\frac{s^\mathrm{p} \mathbf{F}(\mathbf{w}^-)-s^\mathrm{n} \mathbf{F}(\mathbf{w}^+)}
                   {s^\mathrm p - s^\mathrm{n} } +
\frac{s^\mathrm{p} s^\mathrm{n}}{s^\mathrm{p}-s^\mathrm{n}}
\left[\mathbf{w}^--\mathbf{w}^+\right]\otimes \mathbf{n^-}.
@f]
Regarding the definition of the intermediate state $\bar{\mathbf{u}}$ and
$\bar{c}$, several variants have been proposed. The variant originally
proposed uses a density-averaged definition of the velocity, $\bar{\mathbf{u}}
= \frac{\sqrt{\rho^-} \mathbf{u}^- + \sqrt{\rho^+}\mathbf{u}^+}{\sqrt{\rho^-}
+ \sqrt{\rho^+}}$. Since we consider the Euler equations without shocks, we
simply use arithmetic means, $\bar{\mathbf{u}} = \frac{\mathbf{u}^- +
\mathbf{u}^+}{2}$ and $\bar{c} = \frac{c^- + c^+}{2}$, with $c^{\pm} =
\sqrt{\gamma p^{\pm} / \rho^{\pm}}$, in this tutorial program, and leave other
variants to a possible extension. We also note that the HLL flux has been
extended in the literature to the so-called HLLC flux, where C stands for the
ability to represent contact discontinuities.

At the boundaries with no neighboring state $\mathbf{w}^+$ available, it is
common practice to deduce suitable exterior values from the boundary
conditions (see the general literature on DG methods for details). In this
tutorial program, we consider three types of boundary conditions, namely
<b>inflow boundary conditions</b> where all components are prescribed,
@f[
\mathbf{w}^+ = \begin{pmatrix} \rho_\mathrm{D}(t)\\
(\rho \mathbf u)_{\mathrm D}(t) \\ E_\mathrm{D}(t)\end{pmatrix} \quad
 \text{(Dirichlet)},
@f]
<b>subsonic outflow boundaries</b>, where we do not prescribe exterior
solutions as the flow field is leaving the domain and use the interior values
instead; we still need to prescribe the energy as there is one incoming
characteristic left in the Euler flux,
@f[
\mathbf{w}^+ = \begin{pmatrix} \rho^-\\
(\rho \mathbf u)^- \\ E_\mathrm{D}(t)\end{pmatrix} \quad
 \text{(mixed Neumann/Dirichlet)},
@f]
and <b>wall boundary condition</b> which describe a no-penetration
configuration:
@f[
\mathbf{w}^+ = \begin{pmatrix} \rho^-\\
(\rho \mathbf u)^- - 2 [(\rho \mathbf u)^-\cdot \mathbf n] \mathbf{n}
 \\ E^-\end{pmatrix}.
@f]

The polynomial expansion of the solution is finally inserted to the weak form
and test functions are replaced by the basis functions. This gives a discrete
in space, continuous in time nonlinear system with a finite number of unknown
coefficient values $w_j$, $j=1,\ldots,n_\text{dofs}$. Regarding the choice of
the polynomial degree in the DG method, there is no consensus in literature as
of 2019 as to what polynomial degrees are most efficient and the decision is
problem-dependent. Higher order polynomials ensure better convergence rates
and are thus superior for moderate to high accuracy requirements for
<b>smooth</b> solutions. At the same time, the volume-to-surface ratio
of where degrees of freedom are located,
increases with higher degrees, and this makes the effect of the numerical flux
weaker, typically reducing dissipation. However, in most of the cases the
solution is not smooth, at least not compared to the resolution that can be
afforded. This is true for example in incompressible fluid dynamics,
compressible fluid dynamics, and the related topic of wave propagation. In this
pre-asymptotic regime, the error is approximately proportional to the
numerical resolution, and other factors such as dispersion errors or the
dissipative behavior become more important. Very high order methods are often
ruled out because they come with more restrictive CFL conditions measured
against the number of unknowns, and they are also not as flexible when it
comes to representing complex geometries. Therefore, polynomial degrees
between two and six are most popular in practice, see e.g. the efficiency
evaluation in @cite FehnWallKronbichler2019 and references cited therein.

<h3>Explicit time integration</h3>

To discretize in time, we slightly rearrange the weak form and sum over all
cells:
@f[
\sum_{K \in \mathcal T_h} \left(\boldsymbol{\varphi}_i,
\frac{\partial \mathbf{w}}{\partial t}\right)_{K}
=
\sum_{K\in \mathcal T_h}
\left[
\left(\nabla \boldsymbol{\varphi}_i, \mathbf{F}(\mathbf{w})\right)_{K}
-\left<\boldsymbol{\varphi}_i,
\mathbf{n} \cdot \widehat{\mathbf{F}}(\mathbf{w})\right>_{\partial K} +
\left(\boldsymbol{\varphi}_i,\mathbf{G}(\mathbf w)\right)_{K}
\right],
@f]
where $\boldsymbol{\varphi}_i$ runs through all basis functions with from 1 to
$n_\text{dofs}$.

We now denote by $\mathcal M$ the mass matrix with entries $\mathcal M_{ij} =
\sum_{K} \left(\boldsymbol{\varphi}_i,
\boldsymbol{\varphi}_j\right)_K$, and by
@f[
\mathcal L_h(t,\mathbf{w}_h) = \left[\sum_{K\in \mathcal T_h}
\left[
\left(\nabla \boldsymbol{\varphi}_i, \mathbf{F}(\mathbf{w}_h)\right)_{K}
- \left<\boldsymbol{\varphi}_i,
\mathbf{n} \cdot \widehat{\mathbf{F}}(\mathbf{w}_h)\right>_{\partial K}
+ \left(\boldsymbol{\varphi}_i,\mathbf{G}(\mathbf w_h)\right)_{K}
\right]\right]_{i=1,\ldots,n_\text{dofs}}.
@f]
the operator evaluating the right-hand side of the Euler operator, given a
function $\mathbf{w}_h$ associated with a global vector of unknowns
and the finite element in use. This function $\mathcal L_h$ is explicitly time-dependent as the
numerical flux evaluated at the boundary will involve time-dependent data
$\rho_\mathrm{D}$, $(\rho \mathbf{u})_\mathrm{D}$, and $E_\mathbf{D}$ on some
parts of the boundary, depending on the assignment of boundary
conditions. With this notation, we can write the discrete in space, continuous
in time system compactly as
@f[
\mathcal M \frac{\partial \mathbf{w}_h}{\partial t} =
\mathcal L_h(t, \mathbf{w}_h),
@f]
where we have taken the liberty to also denote the global solution
vector by $\mathbf{w}_h$ (in addition to the the corresponding finite
element function). Equivalently, the system above has the form
@f[
\frac{\partial \mathbf{w}_h}{\partial t} =
\mathcal M^{-1} \mathcal L_h(t, \mathbf{w}_h).
@f]

For hyperbolic systems discretized by high-order discontinuous Galerkin
methods, explicit time integration of this system is very popular. This is due
to the fact that the mass matrix $\mathcal M$ is block-diagonal (with each
block corresponding to only variables of the same kind defined on the same
cell) and thus easily inverted. In each time step -- or stage of a
Runge--Kutta scheme -- one only needs to evaluate the differential operator
once using the given data and subsequently apply the inverse of the mass
matrix. For implicit time stepping, on the other hand, one would first have to
linearize the equations and then iteratively solve the linear system, which
involves several residual evaluations and at least a dozen applications of
the linearized operator, as has been demonstrated in the step-33 tutorial
program.

Of course, the simplicity of explicit time stepping comes with a price, namely
conditional stability due to the so-called Courant--Friedrichs--Lewy (CFL)
condition. It states that the time step cannot be larger than the fastest
propagation of information by the discretized differential operator. In more
modern terms, the speed of propagation corresponds to the largest eigenvalue
in the discretized operator, and in turn depends on the mesh size, the
polynomial degree $p$ and the physics of the Euler operator, i.e., the
eigenvalues of the linearization of $\mathbf F(\mathbf w)$ with respect to
$\mathbf{w}$. In this program, we set the time step as follows:
@f[
\Delta t = \frac{\mathrm{Cr}}{p^{1.5}}\left(\frac{1}
           {\max\left[\frac{\|\mathbf{u}\|}{h_u} + \frac{c}{h_c}\right]}\right),
@f]

with the maximum taken over all quadrature points and all cells. The
dimensionless number $\mathrm{Cr}$ denotes the Courant number and can be
chosen up to a maximally stable number $\mathrm{Cr}_\text{max}$, whose value
depends on the selected time stepping method and its stability properties. The
power $p^{1.5}$ used for the polynomial scaling is heuristic and represents
the closest fit for polynomial degrees between 1 and 8, see e.g.
@cite SchoederKormann2018. In the limit of higher degrees, $p>10$, a scaling of
$p^2$ is more accurate, related to the inverse estimates typically used for
interior penalty methods. Regarding the <i>effective</i> mesh sizes $h_u$ and
$h_c$ used in the formula, we note that the convective transport is
directional. Thus an appropriate scaling is to use the element length in the
direction of the velocity $\mathbf u$. The code below derives this scaling
from the inverse of the Jacobian from the reference to real cell, i.e., we
approximate $\frac{\|\mathbf{u}\|}{h_u} \approx \|J^{-1} \mathbf
u\|_{\infty}$. The acoustic waves, instead, are isotropic in character, which
is why we use the smallest feature size, represented by the smallest singular
value of $J$, for the acoustic scaling $h_c$. Finally, we need to add the
convective and acoustic limits, as the Euler equations can transport
information with speed $\|\mathbf{u}\|+c$.

In this tutorial program, we use a specific variant of <a
href="https://en.wikipedia.org/wiki/Runge%E2%80%93Kutta_methods">explicit
Runge--Kutta methods</a>, which in general use the following update procedure
from the state $\mathbf{w}_h^{n}$ at time $t^n$ to the new time $t^{n+1}$ with
$\Delta t = t^{n+1}-t^n$:
@f[
\begin{aligned}
\mathbf{k}_1 &= \mathcal M^{-1} \mathcal L_h\left(t^n, \mathbf{w}_h^n\right),
\\
\mathbf{k}_2 &= \mathcal M^{-1} \mathcal L_h\left(t^n+c_2\Delta t,
                       \mathbf{w}_h^n + a_{21} \Delta t \mathbf{k}_1\right),
\\
&\vdots \\
\mathbf{k}_s &= \mathcal M^{-1} \mathcal L_h\left(t^n+c_s\Delta t,
  \mathbf{w}_h^n + \sum_{j=1}^{s-1} a_{sj} \Delta t \mathbf{k}_j\right),
\\
\mathbf{w}_h^{n+1} &= \mathbf{w}_h^n + \Delta t\left(b_1 \mathbf{k}_1 +
b_2 \mathbf{k}_2 + \ldots + b_s \mathbf{k}_s\right).
\end{aligned}
@f]
The vectors $\mathbf{k}_i$, $i=1,\ldots,s$, in an $s$-stage scheme are
evaluations of the operator at some intermediate state and used to define the
end-of-step value $\mathbf{w}_h^{n+1}$ via some linear combination. The scalar
coefficients in this scheme, $c_i$, $a_{ij}$, and $b_j$, are defined such that
certain conditions are satisfied for higher order schemes, the most basic one
being $c_i = \sum_{j=1}^{i-1}a_{ij}$. The parameters are typically collected in
the form of a so-called <a
href="https://en.wikipedia.org/wiki/Runge%E2%80%93Kutta_methods#Explicit_Runge%E2%80%93Kutta_methods">Butcher
tableau</a> that collects all of the coefficients that define the
scheme. For a five-stage scheme, it would look like this:
@f[
\begin{array}{c|ccccc}
0 \\
c_2 & a_{21} \\
c_3 & a_{31} & a_{32} \\
c_4 & a_{41} & a_{42} & a_{43} \\
c_5 & a_{51} & a_{52} & a_{53} & a_{54} \\
\hline
& b_1 & b_2 & b_3 & b_4 & b_5
\end{array}
@f]

In this tutorial program, we use a subset of explicit Runge--Kutta methods,
so-called low-storage Runge--Kutta methods (LSRK), which assume additional
structure in the coefficients. In the variant used by reference
@cite KennedyCarpenterLewis2000, the assumption is to use Butcher tableaus of
the form
@f[
\begin{array}{c|ccccc}
0 \\
c_2 & a_1 \\
c_3 & b_1 & a_2 \\
c_4 & b_1 & b_2 & a_3 \\
c_5 & b_1 & b_2 & b_3 & a_4 \\
\hline
& b_1 & b_2 & b_3 & b_4 & b_5
\end{array}
@f]
With such a definition, the update to $\mathbf{w}_h^n$ shares the storage with
the information for the intermediate values $\mathbf{k}_i$. Starting with
$\mathbf{w}^{n+1}=\mathbf{w}^n$ and $\mathbf{r}_1 = \mathbf{w}^n$, the update
in each of the $s$ stages simplifies to
@f[
\begin{aligned}
\mathbf{k}_i &=
\mathcal M^{-1} \mathcal L_h\left(t^n+c_i\Delta t, \mathbf{r}_{i} \right),\\
\mathbf{r}_{i+1} &= \mathbf{w}_h^{n+1} + \Delta t \, a_i \mathbf{k}_i,\\
\mathbf{w}_h^{n+1} &= \mathbf{w}_h^{n+1} + \Delta t \, b_i \mathbf{k}_i.
\end{aligned}
@f]
Besides the vector $\mathbf w_h^{n+1}$ that is successively updated, this scheme
only needs two auxiliary vectors, namely the vector $\mathbf{k}_i$ to hold the
evaluation of the differential operator, and the vector $\mathbf{r}_i$ that
holds the right-hand side for the differential operator application. In
subsequent stages $i$, the values $\mathbf{k}_i$ and $\mathbf{r}_i$ can use
the same storage.

The main advantages of low-storage variants are the reduced memory consumption
on the one hand (if a very large number of unknowns must be fit in memory,
holding all $\mathbf{k}_i$ to compute subsequent updates can be a limit
already for $s$ in between five and eight -- recall that we are using
an explicit scheme, so we do not need to store any matrices that are
typically much larger than a few vectors), and the reduced memory access on
the other. In this program, we are particularly interested in the latter
aspect. Since cost of operator evaluation is only a small multiple of the cost
of simply streaming the input and output vector from memory with the optimized
matrix-free methods of deal.II, we must consider the cost of vector updates,
and low-storage variants can deliver up to twice the throughput of
conventional explicit Runge--Kutta methods for this reason, see e.g. the
analysis in @cite SchoederKormann2018.

Besides three variants for third, fourth and fifth order accuracy from the
reference @cite KennedyCarpenterLewis2000, we also use a fourth-order accurate
variant with seven stages that was optimized for acoustics setups from
@cite TseliosSimos2007. Acoustic problems are one of the interesting aspects of
the subsonic regime of the Euler equations where compressibility leads to the
transmission of sound waves; often, one uses further simplifications of the
linearized Euler equations around a background state or the acoustic wave
equation around a fixed frame.


<h3>Fast evaluation of integrals by matrix-free techniques</h3>

The major ingredients used in this program are the fast matrix-free techniques
we use to evaluate the operator $\mathcal L_h$ and the inverse mass matrix
$\mathcal M$. Actually, the term <i>matrix-free</i> is a slight misnomer,
because we are working with a nonlinear operator and do not linearize the
operator that in turn could be represented by a matrix. However, fast
evaluation of integrals has become popular as a replacement of sparse
matrix-vector products, as shown in step-37 and step-59, and we have coined
this infrastructure <i>matrix-free functionality</i> in deal.II for this
reason. Furthermore, the inverse mass matrix is indeed applied in a
matrix-free way, as detailed below.

The matrix-free infrastructure allows us to quickly evaluate the integrals in
the weak forms. The ingredients are the fast interpolation from solution
coefficients into values and derivatives at quadrature points, point-wise
operations at quadrature points (where we implement the differential operator
as derived above), as well as multiplication by all test functions and
summation over quadrature points. The first and third component make use of
sum factorization and have been extensively discussed in the step-37 tutorial
program for the cell integrals and step-59 for the face integrals. The only
difference is that we now deal with a system of $d+2$ components, rather than
the scalar systems in previous tutorial programs. In the code, all that
changes is a template argument of the FEEvaluation and FEFaceEvaluation
classes, the one to set the number of components. The access to the vector is
the same as before, all handled transparently by the evaluator. We also note
that the variant with a single evaluator chosen in the code below is not the
only choice -- we could also have used separate evalators for the separate
components $\rho$, $\rho \mathbf u$, and $E$; given that we treat all
components similarly (also reflected in the way we state the equation as a
vector system), this would be more complicated here. As before, the
FEEvaluation class provides explicit vectorization by combining the operations
on several cells (and faces), involving data types called
VectorizedArray. Since the arithmetic operations are overloaded for this type,
we do not have to bother with it all that much, except for the evaluation of
functions through the Function interface, where we need to provide particular
<i>vectorized</i> evaluations for several quadrature point locations at once.

A more substantial change in this program is the operation at quadrature
points: Here, the multi-component evaluators provide us with return types not
discussed before. Whereas FEEvaluation::get_value() would return a scalar
(more precisely, a VectorizedArray type due to vectorization across cells) for
the Laplacian of step-37, it now returns a type that is
`Tensor<1,dim+2,VectorizedArray<Number>>`. Likewise, the gradient type is now
`Tensor<1,dim+2,Tensor<1,dim,VectorizedArray<Number>>>`, where the outer
tensor collects the `dim+2` components of the Euler system, and the inner
tensor the partial derivatives in the various directions. For example, the
flux $\mathbf{F}(\mathbf{w})$ of the Euler system is of this type. In order to reduce the amount of
code we have to write for spelling out these types, we use the C++ `auto`
keyword where possible.

From an implementation point of view, the nonlinearity is not a big
difficulty: It is introduced naturally as we express the terms of the Euler
weak form, for example in the form of the momentum term $\rho \mathbf{u}
\otimes \mathbf{u}$. To obtain this expression, we first deduce the velocity
$\mathbf{u}$ from the momentum variable $\rho \mathbf{u}$. Given that $\rho
\mathbf{u}$ is represented as a $p$-degree polynomial, as is $\rho$, the
velocity $\mathbf{u}$ is a rational expression in terms of the reference
coordinates $\hat{\mathbf{x}}$. As we perform the multiplication $(\rho
\mathbf{u})\otimes \mathbf{u}$, we obtain an expression that is the
ratio of two polynomials, with polynomial degree $2p$ in the
numerator and polynomial degree $p$ in the denominator. Combined with the
gradient of the test function, the integrand is of degree $3p$ in the
numerator and $p$ in the denominator already for affine cells, i.e.,
for parallelograms/ parallelepipeds.
For curved cells, additional polynomial and rational expressions
appear when multiplying the integrand by the determinant of the Jacobian of
the mapping. At this point, one usually needs to give up on insisting on exact
integration, and take whatever accuracy the Gaussian (more precisely,
Gauss--Legrende) quadrature provides. The situation is then similar to the one
for the Laplace equation, where the integrand contains rational expressions on
non-affince cells and is also only integrated approximately. As these formulas
only integrate polynomials exactly, we have to live with the <a
href="https://mathoverflow.net/questions/26018/what-are-variational-crimes-and-who-coined-the-term">variational
crime</a> in the form of an integration error.

While inaccurate integration is usually tolerable for elliptic problems, for
hyperbolic problems inexact integration causes some headache in the form of an
effect called <b>aliasing</b>. The term comes from signal processing and
expresses the situation of inappropriate, too coarse sampling. In terms of
quadrature, the inappropriate sampling means that we use too few quadrature
points compared to what would be required to accurately sample the
variable-coefficient integrand. It has been shown in the DG literature that
aliasing errors can introduce unphysical oscillations in the numerical
solution for <i>barely</i> resolved simulations. The fact that aliasing mostly
affects coarse resolutions -- whereas finer meshes with the same scheme
work fine -- is not surprising because well-resolved simulations
tend to be smooth on length-scales of a cell (i.e., they have
small coefficients in the higher polynomial degrees that are missed by
too few quadrature points, whereas the main solution contribution in the lower
polynomial degrees is still well-captured -- this is simply a consequence of Taylor's
theorem). To address this topic, various approaches have been proposed in the
DG literature. One technique is filtering which damps the solution components
pertaining to higher polynomial degrees. As the chosen nodal basis is not
hierarchical, this would mean to transform from the nodal basis into a
hierarchical one (e.g., a modal one based on Legendre polynomials) where the
contributions within a cell are split by polynomial degrees. In that basis,
one could then multiply the solution coefficients associated with higher
degrees by a small number, keep the lower ones intact (to not destroy consistency), and
then transform back to the nodal basis. However, filters reduce the accuracy of the
method. Another, in some sense simpler, strategy is to use more quadrature
points to capture non-linear terms more accurately. Using more than $p+1$
quadrature points per coordinate directions is sometimes called
over-integration or consistent integration. The latter name is most common in
the context of the incompressible Navier-Stokes equations, where the
$\mathbf{u}\otimes \mathbf{u}$ nonlinearity results in polynomial integrands
of degree $3p$ (when also considering the test function), which can be
integrated exactly with $\textrm{floor}\left(\frac{3p}{2}\right)+1$ quadrature
points per direction as long as the element geometry is affine. In the context
of the Euler equations with non-polynomial integrands, the choice is less
clear. Depending on the variation in the various variables both
$\textrm{floor}\left(\frac{3p}{2}\right)+1$ or $2p+1$ points (integrating
exactly polynomials of degree $3p$ or $4p$, respectively) are common.

To reflect this variability in the choice of quadrature in the program, we
keep the number of quadrature points a variable to be specified just as the
polynomial degree, and note that one would make different choices depending
also on the flow configuration. The default choice is $p+2$ points -- a bit
more than the minimum possible of $p+1$ points. The FEEvaluation and
FEFaceEvaluation classes allow to seamlessly change the number of points by a
template parameter, such that the program does not get more complicated
because of that.


<h3>Evaluation of the inverse mass matrix with matrix-free techniques</h3>

The last ingredient is the evaluation of the inverse mass matrix $\mathcal
M^{-1}$. In DG methods with explicit time integration, mass matrices are
block-diagonal and thus easily inverted -- one only needs to invert the
diagonal blocks. However, given the fact that matrix-free evaluation of
integrals is closer in cost to the access of the vectors only, even the
application of a block-diagonal matrix (e.g. via an array of LU factors) would
be several times more expensive than evaluation of $\mathcal L_h$
simply because just storing and loading matrices of size
`dofs_per_cell` times `dofs_per_cell` for higher order finite elements
repeatedly is expensive. As this is
clearly undesirable, part of the community has moved to bases where the mass
matrix is diagonal, for example the <i>L<sub>2</sub></i>-orthogonal Legendre basis using
hierarchical polynomials or Lagrange polynomials on the points of the Gaussian
quadrature (which is just another way of utilizing Legendre
information). While the diagonal property breaks down for deformed elements,
the error made by taking a diagonal mass matrix and ignoring the rest (a
variant of mass lumping, though not the one with an additional integration
error as utilized in step-48) has been shown to not alter discretization
accuracy. The Lagrange basis in the points of Gaussian quadrature is sometimes
also referred to as a collocation setup, as the nodal points of the
polynomials coincide (= are "co-located") with the points of quadrature, obviating some
interpolation operations. Given the fact that we want to use more quadrature
points for nonlinear terms in $\mathcal L_h$, however, the collocation
property is lost. (More precisely, it is still used in FEEvaluation and
FEFaceEvaluation after a change of basis, see the matrix-free paper
@cite KronbichlerKormann2019.)

In this tutorial program, we use the collocation idea for the application of
the inverse mass matrix, but with a slight twist. Rather than using the
collocation via Lagrange polynomials in the points of Gaussian quadrature, we
prefer a conventional Lagrange basis in Gauss-Lobatto points as those make the
evaluation of face integrals cheap. This is because for Gauss-Lobatto
points, some of the node points are located on the faces of the cell
and it is not difficult to show that on any given face, the only shape
functions with non-zero values are exactly the ones whose node points
are in fact located on that face. One could of course also use the
Gauss-Lobatto quadrature (with some additional integration error) as was done
in step-48, but we do not want to sacrifice accuracy as these
quadrature formulas are generally of lower order than the general
Gauss quadrature formulas. Instead, we use an idea described in the reference
@cite KronbichlerSchoeder2016 where it was proposed to change the basis for the
sake of applying the inverse mass matrix. Let us denote by $S$ the matrix of
shape functions evaluated at quadrature points, with shape functions in the row
of the matrix and quadrature points in columns. Then, the mass matrix on a cell
$K$ is given by
@f[
\mathcal M^K = S J^K S^\mathrm T.
@f]
Here, $J^K$ is the diagonal matrix with the determinant of the Jacobian times
the quadrature weight (JxW) as entries. The matrix $S$ is constructed as the
Kronecker product (tensor product) of one-dimensional matrices, e.g. in 3D as
@f[
S = S_{\text{1D}}\otimes S_{\text{1D}}\otimes S_{\text{1D}},
@f]
which is the result of the basis functions being a tensor product of
one-dimensional shape functions and the quadrature formula being the tensor
product of 1D quadrature formulas. For the case that the number of polynomials
equals the number of quadrature points, all matrices in $S J^K S^\mathrm T$
are square, and also the ingredients to $S$ in the Kronecker product are
square. Thus, one can invert each matrix to form the overall inverse,
@f[
\left(\mathcal M^K\right)^{-1} = S_{\text{1D}}^{-\mathrm T}\otimes
S_{\text{1D}}^{-\mathrm T}\otimes S_{\text{1D}}^{-\mathrm T}
\left(J^K\right)^{-1}
S_{\text{1D}}^{-1}\otimes S_{\text{1D}}^{-1}\otimes S_{\text{1D}}^{-1}.
@f]
This formula is of exactly the same structure as the steps in the forward
evaluation of integrals with sum factorization techniques (i.e., the
FEEvaluation and MatrixFree framework of deal.II). Hence, we can utilize the
same code paths with a different interpolation matrix,
$S_{\mathrm{1D}}^{-\mathrm{T}}$ rather than $S_{\mathrm{1D}}$.

The class MatrixFreeOperators::CellwiseInverseMassMatrix implements this
operation: It changes from the basis contained in the finite element (in this
case, FE_DGQ) to the Lagrange basis in Gaussian quadrature points. Here, the
inverse of a diagonal mass matrix can be evaluated, which is simply the inverse
of the `JxW` factors (i.e., the quadrature weight times the determinant of the
Jacobian from reference to real coordinates). Once this is done, we can change
back to the standard nodal Gauss-Lobatto basis.

The advantage of this particular way of applying the inverse mass matrix is
a cost similar to the forward application of a mass matrix, which is cheaper
than the evaluation of the spatial operator $\mathcal L_h$
with over-integration and face integrals. (We
will demonstrate this with detailed timing information in the
<a href="#Results">results section</a>.) In fact, it
is so cheap that it is limited by the bandwidth of reading the source vector,
reading the diagonal, and writing into the destination vector on most modern
architectures. The hardware used for the result section allows to do the
computations at least twice as fast as the streaming of the vectors from
memory.


<h3>The test case</h3>

In this tutorial program, we implement two test cases. The first case is a
convergence test limited to two space dimensions. It runs a so-called
isentropic vortex which is transported via a background flow field. The second
case uses a more exciting setup: We start with a cylinder immersed in a
channel, using the GridGenerator::channel_with_cylinder() function. Here, we
impose a subsonic initial field at Mach number of $\mathrm{Ma}=0.307$ with a
constant velocity in $x$ direction. At the top and bottom walls as well as at
the cylinder, we impose a no-penetration (i.e., tangential flow)
condition. This setup forces the flow to re-orient as compared to the initial
condition, which results in a big sound wave propagating away from the
cylinder. In upstream direction, the wave travels more slowly (as it
has to move against the oncoming gas), including a
discontinuity in density and pressure. In downstream direction, the transport
is faster as sound propagation and fluid flow go in the same direction, which smears
out the discontinuity somewhat. Once the sound wave hits the upper and lower
walls, the sound is reflected back, creating some nice shapes as illustrated
in the <a href="#Results">results section</a> below.