## Overview

The following text has been taken from Section 2 of Heimbach and Bugnion (2009) [1]

Conventionally, sensitivities are assessed by perturbing a control variable of interest and investigating the ice sheet's response to the applied perturbation. For each quantity a separate run needs to be performed, and for quantities that vary spatially, assumptions need to be made as to where to perturb. At one extreme, perturbations are spatially uniform (e.g. uniform air temperature perturbation everywhere), at the other extreme, a perturbation at each grid point separately, and a forward simulation for each such perturbation are performed in order to produce a full sensitivity map. For a model configuration with $n_x \times n_y \times n_z \, = \, 82 \times 140 \times 80$ grid points, the initial value control problem alone spans a 918,400-dimensional control space.

Alternatively, the adjoint model is able to provide a complete set, or map, of sensitivities in one single integration. For all its appeal, obtaining an adjoint model is not an easy task, Encouraged by comments on a first version of this text, we attempt, in the following, to provide a short self-contained introduction to adjoint modeling and automatic differentiation as no such description exists yet in the glaciological literature. We establish the connection between the tangent linear model (TLM), the adjoint model (ADM) and Lagrange multipliers. We then show how to use automatic differentiation (AD) to generate tangent linear and adjoint model code.

### The adjoint or Lagrange multiplier method

Our goal is to find sensitivities, i.e. partial derivatives of a scalar-valued objective or cost function $J_0[\mathbf{x}]$ with respect to control variables $\mathbf{u}$. The depedency of $J_0$ on $\mathbf{u}$ is usually indirect, i.e. comes in through the dependency of elements of state variables $\mathbf{x}$ of a model on $\mathbf{u}$. For simplicity we focus on the case where $\mathbf{u}$ is the model's initial state $\mathbf{x}(t_0=0)$. Following the notation of [2] the time-dependent model $L$ has the general form

$\mathbf{x}(t) \, - \, L \left[ \mathbf{x}(t-1) \right] \, = \, 0$

and is integrated from time $t_0=0$ to $t=t_f$. (Note that to simplify notation, we can formally extend the model state space $\mathbf{x}$ to include both model prognostic variables as well as model parameters and boundary conditions.) To take an example, let the objective function consist of the time-mean volume over the last $n+1$ time steps $t_f-n, \, \ldots , \, t_f-1, \, t_f$, of the ice sheet, expressed as the spatial, area-weighted sum $V[\mathbf{x}(t)] = \sum_{i,j} H(i,j,t) {\rm d} A(i,j)$ over the thickness field $\mathbf{H}(t)$ at time $t$ which is an element of the model prognostic state space $\mathbf{x}(t)$. Thus,

$J_0[\mathbf{x}] \, = \, \frac{1}{n+1} \Big( \, V[\mathbf{x}(t_f-n)] + \ldots + V[\mathbf{x}(t_f)] \, \Big)$

The Lagrange multiplier method consists of rewriting the problem of finding derivatives of $J_0$, subject to the constraint of fulfilling eqn. (\ref{eqn:model}) into an extended, and unconstrained problem:

$J \, = \, J_0[\mathbf{x}] \, - \, \sum_{1}^{t_f} \mu^T(t) \big\{ \, \mathbf{x}(t) \, - \, L \left[ \mathbf{x}(t-1) \right] \, \big\}$

For each element of the model state $\mathbf{x}(t)$ at time $t$ we have introduced a corresponding Lagrange multiplier $\mu(t)$.

The set of normal equations are obtained by requiring the partial derivatives of (\ref{eqn:extendedcost}) with respect to each variable for times $t>t_0$ to vanish independently (see e.g. [2] )

\begin{align} \frac{\partial J}{\partial \mu(t)} \, & = \mathbf{x}(t) - L \left[ \mathbf{x}(t-1) \right] = 0 & 1 \le t \le t_f \\ \frac{\partial J}{\partial \mathbf{x}(t)} \, & = \frac{\partial J_0}{\partial \mathbf{x}(t)} - \mu(t) & \\ & \, \quad + \left[ \frac{\partial L [ \mathbf{x}(t) ] }{\partial \mathbf{x}(t)} \right]^T \mu(t+1) = 0 & 0 < t < t_f \\ \frac{\partial J}{\partial \mathbf{x}(t_f)} \, & = \frac{\partial J_0}{\partial \mathbf{x}(t_f)} - \mu(t_f) = 0 & t=t_f \\ \frac{\partial J}{\partial \mathbf{x}(0)} \, & = \frac{\partial J_0}{\partial \mathbf{x}(0)} - \left[ \frac{\partial L [ \mathbf{x}(0) ] }{\partial \mathbf{x}(0)} \right]^T \mu(1) & t_0=0 \end{align}

Eqn. (\ref{eqn:normaleq1}) simply recovers the model equations. The Lagrange multipliers $\mu(t)$ are found through successive evaluation of the normal equations backward in time. Starting at $t=t_f$ we find, via eqn. (\ref{eqn:normaleq3}), and using example cost (\ref{eqn:costvol})

$\mu(t_f) = \frac{\partial J_0}{\partial \mathbf{x}(t_f)} \, = \, \frac{1}{n+1}\frac{\partial V[\mathbf{x}(t_f)]}{\partial \mathbf{x}(t_f)}$

$n+1$ time steps earlier, at $t=t_f-n$, and using the results of $\mu(t_f), \, \ldots, \, \mu(t_f-n+1)$, we obtain, using eqn. (\ref{eqn:normaleq2}):

\begin{align} \mu(t_f-n) \, = & \, \frac{1}{n+1} \left\{ \frac{\partial V[\mathbf{x}(t_f-n)]}{\partial \mathbf{x}(t_f-n)} \right. \\ ~ & \,\, + \left[ \frac{\partial L[\mathbf{x}(t_f-n)]}{\partial \mathbf{x}(t_f-n)} \right]^T \cdot \frac{\partial V[\mathbf{x}(t_f-n+1)]}{\partial \mathbf{x}(t_f-n+1)} \\ ~ & \,\, + \ldots \\ ~ & \,\, + \left[ \frac{\partial L[\mathbf{x}(t_f-n)]}{\partial \mathbf{x}(t_f-n)} \right]^T \cdot \ldots \cdot \left[ \frac{\partial L[\mathbf{x}(t_f-1)]}{\partial \mathbf{x}(t_f-1)} \right]^T \\ ~ & \qquad \cdot \left. \frac{\partial V[\mathbf{x}(t_f)]}{\partial \mathbf{x}(t_f)} \right\} \\ \end{align}

Finally, eqn. (\ref{eqn:normaleq4}) provides the expression for the full gradient sought at time $t_0=0$.

The interpretation is as follows: the Lagrange multiplier $\mu(t)$ provides the complete sensitivity of $J_0$ at time $t$ by accumulating all partial derivatives of $J_0$ with respect to $\mathbf{x}$ from each time step $t_f, \, t_f-1, \, \ldots , \, t$. Those partials taken at later times $t+1, \, \ldots , \, t_f$, are propagated to time $t$ via the adjoint model (ADM), which is the transpose $\left[ \frac{\partial L[\mathbf{x}(t)]}{\mathbf{x}(t)} \right]^T$ of the model Jacobian or tangent linear model (TLM), $\frac{\partial \mathbf{x}(t+1)}{\mathbf{x}(t)} = \frac{\partial L[\mathbf{x}(t)]}{\mathbf{x}(t)}$, and contributions from different times linearly superimposed. Further simplifying the example objective function (\ref{eqn:costvol}) to the special case where instead of the time-mean, only the volume at the last time step $t_f$ is chosen, i.e. $n=0$, eqn. (\ref{eqn:lagrangetime}) simplifies in that all terms except the one containing $\frac{\partial V[\mathbf{x}(t_f)]}{\partial \mathbf{x}(t_f)}$ vanish.

### The tangent linear and adjoint models

All relevant aspects of the LMM have now been derived, but we wish to make plain the duality between the tangent linear model (TLM) and the adjoint model (ADM), and re-state our problem in a slightly different, but equivalent framework which provides a natural basis for introducing the concept of automatic differentiation. The nonlinear model (NLM) $L$ of eqn. (\ref{eqn:model}) may be viewed as a mapping of the state space (we again incorporate all parameters, initial and boundary conditions into an extended state space) from time $t=0$ to $t=t_f$. Then, the cost function $J_0$, eqn. (\ref{eqn:costvol}) is a composite mapping from the state space at $t_0=0$ to $t=t_f$, and from there to the real numbers. To simplify notation, let $\mathbf{y} = \mathbf{x}(t_f)$, and consider the special case $n=0$, i.e. $J_0 = V[\mathbf{x}(t_f)] = V[\mathbf{y}]$. Then,

$\begin{array}{cccccc} J_0 \, : & \mathbf{x}(0) & \longmapsto & \mathbf{x}(t_f) & \longmapsto & J_0[\mathbf{x}] \\ ~ & \mathbf{x}(0) & \longmapsto & L[\mathbf{x}(t_f-1)] & \longmapsto & V[L[\mathbf{x}(t_f-1)]] \\ \end{array}$

The composite nature of the mapping of $J_0$ is readily apparent:

\begin{align} J_0 & = \, V[\mathbf{x}(t_f)] \\ ~ & = \, V[L[L[ \ldots L[ \mathbf{x}(0) ]]]] \\ \end{align}

A perturbation of the initial state $\mathbf{x}(0)$ is linked to a perturbation of $J_0$ by applying the chain rule to (\ref{eqn:composition}):

\begin{align} \delta J_0 & = \, \frac{\partial V}{\partial \mathbf{x}(t_f)} \delta \mathbf{x}(t_f) \\ ~ & = \, \frac{\partial V}{\partial \mathbf{x}(t_f)} \cdot \frac{\partial \mathbf{x}(t_f)}{\partial \mathbf{x}(t_f-1)} \cdot \ldots \cdot \frac{\partial \mathbf{x}(1)}{\partial \mathbf{x}(0)} \cdot \delta \mathbf{x}(0) \\ \end{align}

Recognizing that $\delta J_0$ is just the scalar product of $\frac{\partial V}{\partial \mathbf{y}}$ and $\delta \mathbf{y}$, we can take advantage of the formal definition of an adjoint operator $< x \, | \, Ay> = < A^T x \, | \, y >$ to rewrite this equation:

\begin{align} \delta J_0 & = \, < \frac{\partial V}{\partial \mathbf{y}} \,\, | \,\, \delta \mathbf{y} > \\ ~ & = \, < \frac{\partial V}{\partial \mathbf{x}(t_f)} \,\, | \,\, \frac{\partial \mathbf{x}(t_f)}{\partial \mathbf{x}(t_f-1)} \cdot \ldots \cdot \frac{\partial \mathbf{x}(1)}{\partial \mathbf{x}(0)} \cdot \delta \mathbf{x}(0) > \\ ~ & = \, < \left[ \frac{\partial \mathbf{x}(1)}{\partial \mathbf{x}(0)} \right]^T \cdot \ldots \cdot \left[ \frac{\partial \mathbf{x}(t_f)}{\partial \mathbf{x}(t_f-1)} \right]^T \cdot \frac{\partial V}{\partial \mathbf{x}(t_f)} \,\, | \,\, \delta \mathbf{x}(0) > \\ ~ & = \, < \frac{\partial V}{\partial \mathbf{x}(0)} \,\, | \,\, \delta \mathbf{x}(0) > \\ \end{align}

or, in short, combining the tangent linear and adjoint matrices into $\mathcal{TLM}$ and $\mathcal{ADM}$, respectively:

\begin{align} \delta J_0 & = \, < \frac{\partial V}{\partial \mathbf{x}(t_f)} \,\, | \,\, \mathcal{TLM} \cdot \delta \mathbf{x}(0) > \\ ~ & = \, < \mathcal{ADM} \cdot \frac{\partial V}{\partial \mathbf{x}(t_f)} \,\, | \,\, \delta \mathbf{x}(0) > \\ \end{align}

Eqn. (\ref{eqn:lagrangetime}), (\ref{eqn:scalarprod}), and (\ref{eqn:shortscalar}) expose various features:

• the equivalence between Lagrange multipliers and the adjoint operator;
• the TLM runs forward in time, propagating the effect of a perturbation $\delta \mathbf{x}(0)$ to all model outputs, while the ADM runs backward, accumulating sensitivities of $\delta J_0$ to all model inputs;
• the advantage of the ADM which provides the full gradient of a model-constrained objective function $\frac{\partial V}{\partial \mathbf{x}(0)}$ over the TLM which only provides the projection of $\frac{\partial V}{\partial \mathbf{y}}$ onto the perturbed vector $\delta \mathbf{y}$ from initial perturbation $\delta \mathbf{x}(0)$.

## Derivative code generation via automatic differentiation

The composite nature of the mapping $J_0$ is that of a large number of elementary operations; eqn. (\ref{eqn:mapping}) reflects this at the time-stepping level. Carrying the concept of composition to its extreme, ultimately each line of code can be viewed as an elementary operation. At the elementary level, the AD tool knows the complete set of derivatives (i.e. the elementary Jacobians) for each intrinsic arithmetic function (+, -, *, /, $\sqrt{.}$, $\sin(.)$, etc.), as well as logical/conditional instructions.