Statistical finite elements for misspecified models

Significance Science and engineering have benefited greatly from the ability of finite element methods (FEMs) to simulate nonlinear, time-dependent complex systems. The recent advent of extensive data collection from such complex systems now raises the question of how to systematically incorporate these data into finite element models, consistently updating the solution in the face of mathematical model misspecification with physical reality. This article describes general and widely applicable methodology for the coherent synthesis of data with FEM models, providing a data-driven probability distribution that captures all sources of uncertainty in the pairing of FEM with measurements.

[1] 12 We will now derive a Gaussian measure µ0 which serves as the prior reference measure. Typical analysis of the deterministic 13 problem begins by looking for weak solutions to Eq. (1) by multiplying by testing functions ψ ∈ H 1 0 (Ω) assuming that f ∈ L 2 (Ω).
14 The space H 1 0 (Ω) denotes the Sobolev space with first-order weak derivatives in L 2 (Ω) that vanish on ∂Ω. 15 To start we show that if we assume ξ θ ∈ L 2 (Ω) then we can define a Gaussian measure on L 2 (Ω). Writing the L 2 inner 16 product as f, g = Ω f (x)g(x) dx, which is understood as a Lebesgue integral, we can define a Gaussian measure N (0, C θ ) 17 from the random field as the covariance operator C θ is defined from the kernel k θ (see (1), Chapter 6), i.e. 19 so ξ θ ∼ N (0, C θ ) on L 2 (Ω). 20 We now return to Eq. (1) and multiply by test functions ψ ∈ H 1 0 (Ω) -also assuming solutions u ∈ H 1 0 (Ω) -and integrating 21 over the problem domain Ω to give the weak form:

28
The above can be viewed as an infinite system of equations with the matrix A having entries Aij = A Λ (φi, φj), and thus can 29 be viewed as an operator on 2 as Au = ∞ i=1 uiA Λ (φi, φj) j∈N . We assume the 2 structure for the following theorem.
30 Theorem 1. The operator A : 2 → 2 is invertible. 31 Proof. If A is not invertible then Au = 0 for some u = 0. We show that if Au = 0 and u = 0 that this leads to a contradiction.

50
The full covariance matrix G(θ) can be written as This covariance kernel can be chosen to encode information about the spatial variation of the process. For example, assuming 53 that forcing is smooth in space means the popular squared exponential covariance kernel may be appropriate (we have used this 54 covariance in all of our examples). There is a vast literature on covariance kernels; see (3), Chapter 4, for a thorough treatment.

55
Arrival of data y with some measurement error process η can be written as 56 y = Hu + η, η ∼ N (0, σ 2 I), 57 where H : R M → R N is the now finite-dimensional linear observation operator. The finite-dimensional Bayes theorem gives the posterior distribution over the FEM coefficients in which 58 m = mu + CuH HCuH + σ 2 I −1 (y − Hmu), 59 60 C = Cu − CuH HCuH + σ 2 I −1 HCu. 61 We note that throughout this paper that this should really be also conditional on the differential operators and forcing functions 62 that form the dynamics, but this conditioning is taken as implicit so as to avoid cumbersome notation.

63
Parameters θ are estimated from the log-marginal likelihood p(y | θ, σ, Λ) = N Hmu, HCuH + σ 2 I using either sampling 64 or optimization based approaches depending on the need for uncertainty quantification.

66
Now we consider the parabolic time-dependent problem:

68
For mathematical simplicity we take the separable covariance function Which has the implication that stochastic forcing is white noise in time and spatially regular as per k θ (·, ·). We start by making 71 a spatial discretization of the above with finite elements to give the semidiscrete problem and then use finite difference methods 72 to give the fully discrete problem (see e.g. (4)). We start with the very similar weak form (with u ∈ U , ψ ∈ V ): i∈I v h,i (t)ψi(x), to give the system of ODEs: The covariance of ξ θ , ψj is defined by: Which implies that the Gaussian process ξ t = ( ξ, ψ1 , ξ, ψ2 , . . .) can be described by: This ξ t can be informally thought of as the derivative of a Brownian motion process β t with diffusion matrix G (5). 78 We can concatenate the FEM coefficients into a vector u = (u1(t), u2(t), . . .) to write the above as a vector SDE: where Mij = φi, ψj (the mass matrix) and A and b are as above. Next making a time discretization, u n = (u1(n∆t), u2(n∆t), . . .) , 81 and using an explicit Euler discretization, one can write:

93
The data are arriving at each timestep in the form yn = Hnu n + η n and have some Gaussian additive noise η n ∼ N (0, σ 2 n I).

94
This is a high-dimensional linear Gaussian state space problem. This class of models have been well studied (6) and one can 95 apply standard Kalman filtering methods to obtain the filtering distribution p(u n | y1:n, θ1:n, σ1:n, Λ). We denote by m n|n−1 96 the posterior mean at time n conditional on data up to and including time n − 1. The covariance also follows this notation.

116
A general nonlinear PDE with stochastic forcing can be expressed as: The nonlinear statFEM construction is then as follows, assuming the separable covariance structure as in Eq.
(2). As previous 119 we start with the spatial discretization to give a semidiscrete problem (a vector SDE) and then proceed to the fully discrete 120 solution via finite differences in time. We begin by multiplying by test functions ψ ∈ V and integrating over the problem We next divide the domain Ω into finite elements on a given mesh and look for solutions in terms of a finite set of trial 124 functions {φi}i∈I against test functions {ψi}i∈I as before. We expand solutions in terms of these basis functions, , to give the updated weak form: which now defines a nonlinear, coupled system of stochastic differential equations. In general one can make no comment on the 128 distributional form of the resultant probability measure on function space due to the nonlinear F Λ . Following the derivations 129 given in the linear case above we can write this system as a nonlinear vector SDE: where M is the mass matrix and F Λ is some nonlinear vector function that encodes the action of L Λ and F Λ . terms of the FEM coefficients, however for brevity we will just deal with the above (or variants thereof). Statisticians will note 146 that from here on in we are essentially studying a high-dimensional nonlinear state-space model. We now turn to describing 147 the data assimilation procedure. • yn ∈ R N is the observed data at time n∆t.

153
• Hn : R M → R N is the observation operator that maps from the FEM solution mesh to the observed points x obs .

154
• u n ∈ R M is the FEM coefficients at time n.

155
• η n is a noise process that represents the measurement error for the observed values. This is a Gaussian η n ∼ N (0, σ 2 n I) 156 and thus so is the likelihood p(yn | u n , σn, Λ) = N (Hnu n , σ 2 n I).
then proceed as follows 160 1. Compute the prediction step for the mean by solving for m n|n−1 , and computing the so-called forward covariance: The Jacobians The reason we do this "half" prediction step is that the parameters θn are as yet unknown and must be estimated. The

B. Ensemble Kalman filter (EnKF).
At time n we start with the previous filtering distribution, which with Nens ensemble 179 members, is described by: Then proceed as follows 182 1. Compute the prediction step (without simulating stochastic forcing) for each i = 1, . . . , Nens and computing the so-called forward covariance: [i] . The reason we do this "half" prediction step is that the parameters θn are as yet 187 unknown and must be estimated. The full prediction covariance is approximated as where we employ this approximation in order to use analytical gradients in the optimization step. 3. Using θn, compute the full prediction step: to give the posterior ensemble u n, [i] i ∼ p(u n | y1:n, θn, σn, Λ).

203
C. Discussion of the method. Discretization of the covariance G(θ) requires some care to implement as this is a 2d dimensional integral and in general does not give a sparse matrix as C θ is a positive definite integral operator on Hilbert space. For large problems it may be necessary to impose some sort of sparsity constraint. This can be done e.g. by assuming space-time white noise to give so the mass matrix M is the covariance matrix. Localization (see, e.g., (7)) may also be used to enforce a sparsity constraint 204 and remove spurious correlations that may arise from small ensemble sizes (if the ensemble method is used). Brute force setting 205 entries to zero, below some threshold value, could also be used. [4] 219 We choose to introduce prior information in order to regularize the optimization problem. Anecdotally, appropriate choices of 220 these priors makes the optimization much better-behaved.

221
If parameters are assumed to be constant across time then the updating procedures provided above could be modified to 222 account for this. One possibility is outlined in (6) and is based on optimizing the full likelihood: to evaluate this likelihood, also analytically calculating gradients on the way through. Priors can also be incorporated.

226
Optimization can then proceed via standard methods (e.g. L-BFGS-B as we use for the marginal likelihood) noting that each 227 likelihood (and gradient) evaluation requires computing the filtered distribution p(u N | y1:N , θ, σ) for fixed parameters θ and σ. 229 We now illustrate the above using the eKdV equation, given by

statFEM for KdV
For the deterministic problem, we discretize using the scheme outlined in (10). Start by defining the time grid as nt evenly 232 spaced points (0, ∆t, 2∆t, . . . , (nt − 1)∆t) , with u n (x) = u(x, n∆t). We use Crank-Nicolson for time integration and continue 233 as usual by multiplying with test functions ψ ∈ V 0 (V 0 some appropriate function space) and integrating with respect to x 234 over [0, L] to give the weak form x )) with k a squared-exponential kernel. To minimize the order of derivatives split the above into a system of three first order equations: This is necessary as we use which requires estimating the an, bn at each iteration. This gives the best least-squares linear projection from the current 266 prediction to the data, to give us a predicted datasetỹn via using the linear shift: 267ỹ n = an + bnu n .

268
Using this method we can extend our dataset to now be as large as our FEM solution grid. In the paper we project to a grid of 100 points, uniformly spaced across the solution grid. This is only for the parameter estimation step. We do not use these values as the data in the analysis/conditioning step of the Kalman filter; we just use this to estimate the parameters, to improve the conditioning of the problem. We use the same weakly informative priors τn ∼ N+(1, 1 2 ), n ∼ N+(1, 1 2 ).

Further examples 269
In this section we demonstrate the application of the method on a set of PDEs. We use two well-known systems: the [5] 277 We discretize using P1 trial and test functions with 512 elements in space, and an implicit Euler timestepping scheme (∆t = 0.02).
To deal with the fourth order system we split into a system of coupled PDEs (similar to KdV), to give the semi-discrete (time-discretized) equations We generate data with an under-damped model, with ν = 0.95. StatFEM conditions on 52 observations per timestep, which 279 have simulated Gaussian error, η n ∼ N (0, 0.05 2 ). For the statFEM model, we assume the standard base model in which mismatch is induced by different dissipation magnitudes (0.95 to 1). As before, set the covariance of ξ θ to (t, t ), with k θ given by a squared exponential covariance function now with parameters θn = (τn, n) and σn estimated at each timestep n. The data generating process is given by yn = Hu n + η n ; the data are generated according to the KS model and a measurement error. The initial conditions are set from running the data-generating process (i.e. Eq. (5) for ν = 0.95), initialized with u(x, 0) = sin(x/16), for 2000 timesteps to skip over transient behaviour. In this case to compute the covariance matrix G(θn) we approximate

286
To compute the filtering distribution p(u n | y1:n, θ1:n, σ1:n, Λ) we use a modified version of the EKF algorithm, with the priors given in Eq. (4). This modification makes use of the eigendecomposition of the covariance matrices C n|n and G(θn) and is called the low rank EKF in (15). For the cost of a truncated eigendecomposition this avoids inverting the Jacobian many times. We use this approximation when computing the prediction covariancesĈ n|n−1 andĜ(θn) which are (J n ) −1 J n−1 C n|n−1 (J n−1 ) (J n ) − and (J n ) −1 G(θn)(J n ) − respectively. Taking an eigendecomposition results in products of the form Results are shown in Figure S2. Figure  damping. The equation is often taken as a stepping-stone to the full simulation of the Navier-Stokes equations as it retains 296 nonlinearity and viscous effects but is able to be solved in 1D (where the full Navier-Stokes requires at least 2D). In this 297 subsection we study the 2D Burgers' equation given by Re ∇ 2 v, u := u(x, t), (x, y) ∈ [0, 2] × [0, 2], t ∈ [0, 5], u(0, y, t) = u(2, y, t), u(x, 0, t) = u(x, 2, t), v(0, y, t) = v(2, y, t), v(x, 0, t) = v(x, 2, t), u(x, y, 0) = v(x, y, 0) = sin(π(x + y)).
[6] Data yn are generated by solving Eq. (6) with Reynolds number Re = 150. The statFEM conditions on 102 observations per timestep, which are jittered with simulated observational error η n ∼ N (0, 0.01 2 ). Data is observed on the velocity field u only, and the observation locations are shown on the mesh in Figure S3a. Note that in application more careful design of experiments may be necessary in order to choose these observation locations, especially for complex domains and/or more nonlinear regimes. Due to damping noise becomes more apparent as the simulation runs. The base model is taken to be Eq. (6) with the added unknown forcing on the evolution equation for u only Mismatch is induced by setting Re = 100 which in practice may occur because of discrepancy between measurements and 301 modelling assumptions. The covariance of ξ θ is given by , with k θ given by a squared 302 exponential covariance function with parameters θn = (τn, n). In this example we set n = 1 to avoid recomputing G(θn) in 303 each iteration, which is computed in the same way as in the KS example (i.e. using the mass matrix G(θn) = MK(θn)M ).

304
The assumed data generating process is and data are assumed to be generated according to the Burgers equation plus some measurement error.

307
In implementation we compute the posterior using a regular mesh with 64 × 64 elements in space (shown in Figure S3a) 308 and timestep ∆t = 0.01. The L 2 discretization error for the initial conditions is shown in Figure S3b

314
Results are shown in Figure S4. Note that in this section we only plot the u component, as this is the field on which the 315 data is observed. The data generating process ( Figure S4a) and posterior means ( Figure S4b  The posterior surfaces shown in Figure S4d show the posterior means and the observed data points. The posterior mean 321 shows agreement with the observed data. The normalized relative L 2 error ū − u DGP 2/ u DGP 2 is also plotted in Figure S4e, 322 for the prior mean and the posterior mean (this is approximated using FEM coefficients). The errors initially grow rapidly 323 and then stabilise with time. Rapid initial increase in error is the same for the prior and posterior and after the filtering 324 warm-up period -in which the noise is also overestimated (see Figure S4f) -the posterior error becomes lower than the 325 prior and appears to increase at a lower rate. Increasing error is to be expected as we are not updating the model coefficients, 326 only the numerical solution. Hence there will always be systematic differences between the data generating process and the 327 posterior which conditioning on data can only partly account for. We conjecture this results in consistently increasing errors.

328
Possible future work could investigate ways to estimate coefficients during the filtering procedure thus eliminating this source 329 of misspecification.

330
Finally the parameter estimates are shown in Figure S4f. The true value of the noise is shown as a dashed orange band.

331
After some warm-up time the parameters reach stable estimates with σn being slightly overestimated from the data. The         (f) Burgers parameter estimates. The dashed orange line shows the true value of the noise which appears slightly overestimated at each time. We conjecture this may be due to identifiability problems with the stochastic forcing magnitude.