Lagrange multipliers and adjoints

From Interactive System for Ice sheet Simulation
Jump to: navigation, search



The following text has been taken from Section 2 of Heimbach and Bugnion (2009) [1]

Conventionally, sensitivities are assessed by perturbing a control variable of interest and investigating the ice sheet's response to the applied perturbation. For each quantity a separate run needs to be performed, and for quantities that vary spatially, assumptions need to be made as to where to perturb. At one extreme, perturbations are spatially uniform (e.g. uniform air temperature perturbation everywhere), at the other extreme, a perturbation at each grid point separately, and a forward simulation for each such perturbation are performed in order to produce a full sensitivity map. For a model configuration with n_x \times n_y \times n_z \, = \, 82 \times 140 \times 80 grid points, the initial value control problem alone spans a 918,400-dimensional control space.

Alternatively, the adjoint model is able to provide a complete set, or map, of sensitivities in one single integration. For all its appeal, obtaining an adjoint model is not an easy task, Encouraged by comments on a first version of this text, we attempt, in the following, to provide a short self-contained introduction to adjoint modeling and automatic differentiation as no such description exists yet in the glaciological literature. We establish the connection between the tangent linear model (TLM), the adjoint model (ADM) and Lagrange multipliers. We then show how to use automatic differentiation (AD) to generate tangent linear and adjoint model code.

The adjoint or Lagrange multiplier method

Our goal is to find sensitivities, i.e. partial derivatives of a scalar-valued objective or cost function J_0[\mathbf{x}] with respect to control variables \mathbf{u}. The depedency of J_0 on \mathbf{u} is usually indirect, i.e. comes in through the dependency of elements of state variables \mathbf{x} of a model on \mathbf{u}. For simplicity we focus on the case where \mathbf{u} is the model's initial state \mathbf{x}(t_0=0). Following the notation of [2] the time-dependent model L has the general form

\mathbf{x}(t) \, - \, L \left[ \mathbf{x}(t-1) \right] \, = \, 0

and is integrated from time t_0=0 to t=t_f. (Note that to simplify notation, we can formally extend the model state space \mathbf{x} to include both model prognostic variables as well as model parameters and boundary conditions.) To take an example, let the objective function consist of the time-mean volume over the last n+1 time steps t_f-n, \, \ldots , \, t_f-1, \, t_f, of the ice sheet, expressed as the spatial, area-weighted sum V[\mathbf{x}(t)] = \sum_{i,j} H(i,j,t) {\rm d} A(i,j) over the thickness field \mathbf{H}(t) at time t which is an element of the model prognostic state space \mathbf{x}(t). Thus,

J_0[\mathbf{x}] \, = \, 
\frac{1}{n+1} \Big( \,
V[\mathbf{x}(t_f-n)] + \ldots + V[\mathbf{x}(t_f)] \, \Big)

The Lagrange multiplier method consists of rewriting the problem of finding derivatives of J_0, subject to the constraint of fulfilling eqn. (\ref{eqn:model}) into an extended, and unconstrained problem:

J \, = \, J_0[\mathbf{x}] \, - \, 
\sum_{1}^{t_f} \mu^T(t) 
\big\{ \, \mathbf{x}(t) \, - \, L \left[ \mathbf{x}(t-1) \right] \, \big\}

For each element of the model state \mathbf{x}(t) at time t we have introduced a corresponding Lagrange multiplier \mu(t).

The set of normal equations are obtained by requiring the partial derivatives of (\ref{eqn:extendedcost}) with respect to each variable for times t>t_0 to vanish independently (see e.g. [2] )

\frac{\partial J}{\partial \mu(t)} \, & = 
\mathbf{x}(t) - L \left[ \mathbf{x}(t-1) \right] = 0 &  1 \le t \le t_f \\
\frac{\partial J}{\partial \mathbf{x}(t)} \, & = 
\frac{\partial J_0}{\partial \mathbf{x}(t)} - \mu(t) & \\
 & \, \quad +
\left[ \frac{\partial L [ \mathbf{x}(t) ] }{\partial \mathbf{x}(t)} \right]^T
\mu(t+1) = 0 &  0 < t < t_f \\
\frac{\partial J}{\partial \mathbf{x}(t_f)} \, & = 
\frac{\partial J_0}{\partial \mathbf{x}(t_f)} - \mu(t_f) = 0 & t=t_f \\
\frac{\partial J}{\partial \mathbf{x}(0)} \, & = 
\frac{\partial J_0}{\partial \mathbf{x}(0)} - 
\left[ \frac{\partial L [ \mathbf{x}(0) ] }{\partial \mathbf{x}(0)} \right]^T
\mu(1) & t_0=0

Eqn. (\ref{eqn:normaleq1}) simply recovers the model equations. The Lagrange multipliers \mu(t) are found through successive evaluation of the normal equations backward in time. Starting at t=t_f we find, via eqn. (\ref{eqn:normaleq3}), and using example cost (\ref{eqn:costvol})

\mu(t_f) = \frac{\partial J_0}{\partial \mathbf{x}(t_f)}
\, = \,
\frac{1}{n+1}\frac{\partial V[\mathbf{x}(t_f)]}{\partial \mathbf{x}(t_f)}

n+1 time steps earlier, at t=t_f-n, and using the results of  \mu(t_f), \, \ldots, \, \mu(t_f-n+1), we obtain, using eqn. (\ref{eqn:normaleq2}):

\mu(t_f-n) \, = & \, \frac{1}{n+1} \left\{
\frac{\partial V[\mathbf{x}(t_f-n)]}{\partial \mathbf{x}(t_f-n)} \right. \\
~ & \,\, +
\left[ \frac{\partial L[\mathbf{x}(t_f-n)]}{\partial \mathbf{x}(t_f-n)} \right]^T \cdot
\frac{\partial V[\mathbf{x}(t_f-n+1)]}{\partial \mathbf{x}(t_f-n+1)} \\
~ & \,\, + \ldots \\
~ & \,\, + 
\left[ \frac{\partial L[\mathbf{x}(t_f-n)]}{\partial \mathbf{x}(t_f-n)} \right]^T
\cdot \ldots \cdot
\left[ \frac{\partial L[\mathbf{x}(t_f-1)]}{\partial \mathbf{x}(t_f-1)} \right]^T \\
~ & \qquad \cdot \left.
\frac{\partial V[\mathbf{x}(t_f)]}{\partial \mathbf{x}(t_f)} \right\} \\

Finally, eqn. (\ref{eqn:normaleq4}) provides the expression for the full gradient sought at time t_0=0.

The interpretation is as follows: the Lagrange multiplier \mu(t) provides the complete sensitivity of J_0 at time t by accumulating all partial derivatives of J_0 with respect to \mathbf{x} from each time step t_f, \, t_f-1, \, \ldots , \, t. Those partials taken at later times  t+1, \, \ldots , \, t_f, are propagated to time t via the adjoint model (ADM), which is the transpose \left[ \frac{\partial L[\mathbf{x}(t)]}{\mathbf{x}(t)} \right]^T of the model Jacobian or tangent linear model (TLM),  \frac{\partial \mathbf{x}(t+1)}{\mathbf{x}(t)}
\frac{\partial L[\mathbf{x}(t)]}{\mathbf{x}(t)}, and contributions from different times linearly superimposed. Further simplifying the example objective function (\ref{eqn:costvol}) to the special case where instead of the time-mean, only the volume at the last time step t_f is chosen, i.e. n=0, eqn. (\ref{eqn:lagrangetime}) simplifies in that all terms except the one containing  \frac{\partial V[\mathbf{x}(t_f)]}{\partial \mathbf{x}(t_f)} vanish.

The tangent linear and adjoint models

All relevant aspects of the LMM have now been derived, but we wish to make plain the duality between the tangent linear model (TLM) and the adjoint model (ADM), and re-state our problem in a slightly different, but equivalent framework which provides a natural basis for introducing the concept of automatic differentiation. The nonlinear model (NLM) L of eqn. (\ref{eqn:model}) may be viewed as a mapping of the state space (we again incorporate all parameters, initial and boundary conditions into an extended state space) from time t=0 to t=t_f. Then, the cost function J_0, eqn. (\ref{eqn:costvol}) is a composite mapping from the state space at t_0=0 to t=t_f, and from there to the real numbers. To simplify notation, let \mathbf{y} = \mathbf{x}(t_f), and consider the special case n=0, i.e. J_0 = V[\mathbf{x}(t_f)] = V[\mathbf{y}]. Then,

J_0 \, : &
\mathbf{x}(0) & \longmapsto & \mathbf{x}(t_f) &
\longmapsto & J_0[\mathbf{x}] \\
~ &
\mathbf{x}(0) & \longmapsto & L[\mathbf{x}(t_f-1)] &
\longmapsto & V[L[\mathbf{x}(t_f-1)]] \\

The composite nature of the mapping of J_0 is readily apparent:

J_0 & = \, V[\mathbf{x}(t_f)] \\
~ & = \, V[L[L[ \ldots L[ \mathbf{x}(0) ]]]] \\

A perturbation of the initial state \mathbf{x}(0) is linked to a perturbation of J_0 by applying the chain rule to (\ref{eqn:composition}):

\delta J_0 & = \, 
\frac{\partial V}{\partial \mathbf{x}(t_f)} \delta \mathbf{x}(t_f) \\
~ & = \, \frac{\partial V}{\partial \mathbf{x}(t_f)} \cdot
\frac{\partial \mathbf{x}(t_f)}{\partial \mathbf{x}(t_f-1)} \cdot \ldots \cdot
\frac{\partial \mathbf{x}(1)}{\partial \mathbf{x}(0)} \cdot \delta \mathbf{x}(0) \\

Recognizing that \delta J_0 is just the scalar product of \frac{\partial V}{\partial \mathbf{y}} and \delta \mathbf{y}, we can take advantage of the formal definition of an adjoint operator  < x \, | \, Ay> = < A^T x \, | \, y > to rewrite this equation:

\delta J_0 & = \, 
< \frac{\partial V}{\partial \mathbf{y}} \,\, | \,\,
\delta \mathbf{y} 
> \\
~ & = \, <
\frac{\partial V}{\partial \mathbf{x}(t_f)} \,\, | \,\,
\frac{\partial \mathbf{x}(t_f)}{\partial \mathbf{x}(t_f-1)} \cdot \ldots \cdot
\frac{\partial \mathbf{x}(1)}{\partial \mathbf{x}(0)} \cdot \delta \mathbf{x}(0) 
> \\
~ & = \, <
\left[ \frac{\partial \mathbf{x}(1)}{\partial \mathbf{x}(0)} \right]^T
\cdot \ldots \cdot
\left[ \frac{\partial \mathbf{x}(t_f)}{\partial \mathbf{x}(t_f-1)} \right]^T
\frac{\partial V}{\partial \mathbf{x}(t_f)} \,\, | \,\,
\delta \mathbf{x}(0) 
> \\
~ & = \, <
\frac{\partial V}{\partial \mathbf{x}(0)} \,\, | \,\,
\delta \mathbf{x}(0) 
> \\

or, in short, combining the tangent linear and adjoint matrices into \mathcal{TLM} and \mathcal{ADM}, respectively:

\delta J_0 & = \, <
\frac{\partial V}{\partial \mathbf{x}(t_f)} \,\, | \,\,
\mathcal{TLM} \cdot  \delta \mathbf{x}(0) 
> \\

~ & = \, <
\mathcal{ADM} \cdot 
\frac{\partial V}{\partial \mathbf{x}(t_f)} \,\, | \,\,
\delta \mathbf{x}(0) 
> \\

Eqn. (\ref{eqn:lagrangetime}), (\ref{eqn:scalarprod}), and (\ref{eqn:shortscalar}) expose various features:

  • the equivalence between Lagrange multipliers and the adjoint operator;
  • the TLM runs forward in time, propagating the effect of a perturbation \delta \mathbf{x}(0) to all model outputs, while the ADM runs backward, accumulating sensitivities of \delta J_0 to all model inputs;
  • the advantage of the ADM which provides the full gradient of a model-constrained objective function \frac{\partial V}{\partial \mathbf{x}(0)} over the TLM which only provides the projection of \frac{\partial V}{\partial \mathbf{y}} onto the perturbed vector \delta \mathbf{y} from initial perturbation \delta \mathbf{x}(0).

Derivative code generation via automatic differentiation

In general the complexity and the effort involved in the development of an adjoint model matches that of its parent nonlinear model development and frequently prohibits adjoint model applications. An alternative to hand-coding the adjoint (i.e. coding the discretized adjoint equations) and major step forward is the use of automatic (or algorithmic) differentiation (AD) to derivative (e.g. ADM or TLM) code generation \citep{grie:00}. The advent of AD source-to-source transformaiton tools such as the commercial tool Transformation of Algorithms in Fortran (TAF) \citep{gier-etal:05} or the open-source tool OpenAD \citep{utke-etal:08} has enabled the development of exact adjoint models from complex, nonlinear forward models. In the oceanographic context, there is now a decade of experience in applying this method to a state-of-the-art ocean general circulation model (e.g. \cite{maro-etal:99,gala-etal:02,heim-etal:02,stam-etal:03,heim:08}), which encourages us to apply such tools to state-of-the-art ice sheet models.

The composite nature of the mapping J_0 is that of a large number of elementary operations; eqn. (\ref{eqn:mapping}) reflects this at the time-stepping level. Carrying the concept of composition to its extreme, ultimately each line of code can be viewed as an elementary operation. At the elementary level, the AD tool knows the complete set of derivatives (i.e. the elementary Jacobians) for each intrinsic arithmetic function (+, -, *, /, \sqrt{.}, \sin(.), etc.), as well as logical/conditional instructions.


  1. Heimbach, P. and V. Bugnion, 2009: Greenland ice sheet volume sensitivity to basal, surface, and initial conditions, derived from an adjoint model.Annals of Glaciology, 50(52), 67-80. [1]
  2. 2.0 2.1 Wunsch, C., 2006: Discrete inverse and state estimation problems: with geophysical fluid applications. Cambridge University Press.