Bayesian model calibration

Overview

Suppose we have a mathematical model describing the biological system of interest. The model may be a deterministic model, described by a system of coupled ordinary differential equations (ODEs), or a stochastic model, described by a continuous time Markov process with discrete state space. Note that, in what follows, we can treat deterministic models as special cases of stochastic models and so we refer to the stochastic case throughout. In general the model will depend on some unknown parameters/quantities, which we denote by U. These unknown quantities may be reaction rate constants and/or initial quantities of the chemical species in the model. The true underlying state of the system is the level of all (chemical) species in the system at any particular time and we denote this by Y. Therefore, given a particular value for U, the probability of observing Y is p(Y|U).

Model calibration involves determining values for U that are consistent with observed data, which we denote by Z. The data are usually noisy measurements of the true underlying state of the system (Y) and typically take the form of time-course measurements. The relationship between Y and Z is expressed in terms of the measurement error model p(Z|Y,V). Here V represents additional parameters that are in the measurement error model; for example, we might model the data Z by a Gaussian distribution with mean Y and variance V. Typically the model will be sufficiently complicated that it cannot be analysed mathematically: essentially the "likelihood" p(Y|U) cannot be computed directly due to partial observation of the time-course. However, we will assume that the model can be simulated (via forward simulation), that is, for given values of the unknown parameters/quantities (U), a realisation of the system can be forward simulated through time.

The standard method for calibrating the model by using forward simulation works as follows. Suppose the measurement error parameters V are known. If a value for U is chosen (say U*) and the model is simulated to generate a realisation of Y (call it Y*), we can then compare Y* to the observed data Z and if they are close (in terms of the measurement error distribution p(Z|Y*,V)) informally conclude that U* is consistent with the observed data. Clearly there won't be a single value of U that is consistent with the observed data (due to the measurement error), rather there will be a range of values with some of these values being more likely than others. It is straightforward to deal with the much more common scenario in which the value of V is not known and to simultaneously infer plausible values for both U and V based on the data.

The most natural way to proceed is to adopt a Bayesian approach to the calibration process. In this approach, uncertainties about likely values of U and V are expressed through a prior probability distribution p(U,V), and then a simulation-based calibration algorithm (described below) is repeated many times for different choices of U and V to give a sample of U* and V* values from the posterior probability distribution p(U,V|Z), all of which are consistent with the data.

Bayesian calibration via sequential Markov chain Monte Carlo

We use Markov chain Monte Carlo (MCMC) methods as the standard technique for sampling from Bayesian posterior distributions (except for situations where analytical methods can be used). The basic idea underlying MCMC methods is to construct an algorithm which is a Markov chain and has the posterior probability distribution as its stationary distribution. The Markov chain is then simulated and, after discarding some values to burn in the Markov chain (and possibly thinning the chain to reduce its autocorrelation), its values form a sample from the posterior distribution. Below we describe the algorithm in more detail and include an illustrative example.

Static MCMC scheme

Before describing the sequential algorithm, we show how the model can be calibrated against the whole dataset Z by using a more standard static MCMC scheme. The scheme produces realisations of the posterior distribution p(U,V,Y|Z) by constructing a Markov chain (over the unknown quantities U, V, Y) which has this posterior distribution as its stationary distribution. Each iteration of the static MCMC scheme proceeds as follows.

Suppose that the current sampled values of U, V and Y are denoted Uc, Vc and Yc, respectively. Then

  1. Propose candidate values of U and V, denoted Up and Vp from some appropriate proposal distributions.

  2. Obtain a realisation of Y, denoted Yp, by forward simulating the model conditional on unknown quantities Up.

  3. Compute the acceptance ratio A, which is the product of the ratio of prior densities for U and V (proposed divided by current), the ratio of measurement error probabilities and the inverse of the proposal ratio (for U and V). Note that all "likelihoods" involving Y or Y* cancel out of this ratio. This is essential for this method to work as we cannot directly compute "likelihoods" of the form p(Y|U). It is a form of "likelihood-free" MCMC.
  4. Accept (Up,Vp,Yp) as a sample from the posterior distribution p(U,V,Y|Z) with probability min(1,A) and set Uc <- Up, Vc <- Vp, Yc <- Yp. Otherwise, return Uc, Vc and Yc as a sample from the posterior.

This process is repeated many (thousands of) times to generate a sample of (U,V,Y) values from the joint posterior distribution p(U,V,Y|Z). Thus the sampled values of U are the (marginal) posterior distribution p(U|Z), that is, are values of the unknown quantities that are consistent with the observed data.

This calibration procedure is very general and can be applied to any model for which we can get realisations by using forward simulation. It works well in practice, particularly for deterministic models such as those based on ODEs. However, one issue that limits its applicability (particularly for stochastic models) is the generally high dimension of the observed data Z (and latent data Y). When there are many observations in the time-course, it is very unlikely that the simulated Y is close enough to the observed Z and this gives a very low value for the acceptance probability min(1,A). This is generally the case even when we know the "true" values of the unknown quantities U. Essentially this is a manifestation of the so-called curse of dimensionality. One way to overcome this problem is to split the observed data into smaller, more manageable batches and analyse each batch in turn, that is, sequentially. This strategy is made possible by the Markovian structure of the stochastic model.

Sequential MCMC scheme

First we partition the observed data Z into n (sequential) batches Z1, Z2, ..., Zn, where each batch consists of the data at at least one time point. The underlying state of the system Y can similarly be partitioned into batches Y1, Y2, ... ,Yn. Crucially the batches should not be too small nor too large. Smaller batches will result in larger acceptance probabilities (essentially there is a greater chance of simulating a Yi that is close enough to the corresponding Zi). However there is also a trade-off with the number of batches (n) as the time to run the algorithm scales with n. We then apply the static MCMC algorithm sequentially to each batch of data, using the posterior distribution after each batch is included as the prior distribution for inclusion of the next batch.

Suppose that we have already calibrated the model against the data from batches Z1,...,Zi-1, and therefore have a sample of N values of U, V, Yi-1 (called particles) from the "posterior" distribution p(U,V,Yi-1|Z1,...,Zi-1). Now suppose the current sampled values of U, V, Yi-1 and Yi are denoted by Uc, Vc, Yi-1c and Yic. Then a typical iteration of the sequential MCMC algorithm is as follows:

  1. Propose candidate values of U,V and Yi-1, denoted by Up, Vp and Yi-1p, from their "prior" distribution p(U,V,Yi-1|Z1,...,Zi-1). This is achieved by sampling (uniformly at random) one of the N sampled particles from p(U,V,Yi-1|Z1,...,Zi-1) which are then possibly jittered (see later).

  2. Obtain a realisation of Yi, denoted Yip, by forward simulating the model conditional on Yi-1p and Up.

  3. Compute the acceptance ratio, A=p(Zi|Yip,Vp)/p(Zi|Yic,Vc).

  4. Accept (Up,Vp,Yi-1p,Yip) as a sample from the "posterior" distribution p(U,V,Yi-1Yi|Z1,...,Zi-1,Zi) with probability min(1,A) and set Uc <- Up, Vc <- Vp, Yi-1c <- Yi-1p and Yic <- Yip. Otherwise, return Uc, Vc, Yi-1c and Yic as a sample from the "posterior" distribution.

Repeating this process many (thousands of) times will, after discarding some values to burn in the Markov chain and thinning the chain to reduce its autocorrelation, eventually generate a sample of N values of (U,V,Yi-1,Yi) from the "posterior" distribution p(U,V,Yi-1,Yi|Z1,...,Zi-1,Zi). Thus the sampled values of U,V,Yi are from the posterior distribution p(U,V,Yi|Z1,...,Zi) and reflect our uncertainty about these quantities after taking account of the data in batches Z1,...,Zi. These sampled values form our prior sample (particles) which we shall use to calibrate the model against the next batch of data. This process is repeated until all n batches of data have been included. The sampled values of U and V at the final stage will be samples from the full posterior distribution p(U,V|Z1,...,Zn)=p(U,V|Z), and so the U values will be values that are consistent with the observed data (allowing for uncertainty about V).

One problem with the above algorithm is that at all stages, the prior distribution is represented by a limited number of sample values and not by an analytical form. This can result in problems of particle degeneration, in which the number of distinct values in the prior distribution becomes too small to effectively represent the full prior distribution. However, this can be overcome by proposing candidate values for (U,V) from a kernel density estimate of the prior distribution. In practice this is straightforward and simply involves jittering a particle (from the prior) by purturbing it by a small amount of random (Gaussian) noise.

This approach of using sequential MCMC based on forward simulation from the model provides the most general and widely applicable approach to calibrating complex models and is at the heart of the computational framework employed by the CaliBayes project. An additional advantage of the sequential MCMC approach to calibration is that it deals naturally with multiple datasets obtained from different sources, perhaps pertaining to different aspects of the model.

Illustrative example

We illustrate the ideas by using a simple example. Suppose we have time course data on three of the five species from a simple model of prokaryotic gene regulation. The model is described by a system of eight equations, each with an unknown rate constant, and for simplicity, the initial values of the species Y (at time t=0) were assumed to be known. Therefore the unknown quantities are the eight rate constants U=(c1,c2,...,c8). We have chosen independent uniform prior distributions for the log of the rate constants. A Poisson measurement error model was chosen and so there are no additional measurement error parameters, that is, p(Z|Y,V)=p(Z|Y). The figure below shows the observed time-course data Z on the LHS with the highlighted points at time 0 being the known initial values for Y. On the RHS we have views of the prior distribution for two of the rate constants: the marginal distribution for log(c5) and the joint distribution for log(c5) and log(c6) - the blue dots are the "true values" for these rate constants, that is, the values that were used to simulate the time-course data on the LHS.

We now start the calibration procedure by calibrating the model using only the first batch of data Z1. Here this is simply the observed data at time t=1 - highlighted on the left hand plot below by the vertical dashed line. After running the algorithm we obtain a sample of values from the "posterior" distribution p(U,Y1|Z1).

pfexamscat1.png

The RHS plots show how our understanding about which values of c5 and c6 are reasonable has changed from their prior uniform shape. Just calibrating the model using the data at the first time point has already focused attention on a smaller region of plausible values for these rate constants and, for example, now gives much more weight to values of log(c5) between -4 to -1.

Of course what we really want is a sample from the "full" posterior distribution p(U|Z1,Z2,...,Zn) as this expresses our uncertainty about the unobserved quantities in the light of all the data (not just Z1). This is where the sequential aspect of the algorithm comes into play. First we use our sample of values from p(U,Y1|Z1) as our "prior" distribution for the unknown quantities, and then calibrate the model using the information in the second batch of data Z2. The algorithm then produces a sample which describes our uncertainty about the unknown quantities given the data in Z1 and Z2, that is, from the "posterior" distribution p(U,Y2|Z1,Z2) [shown below]. In turn these values are then used as the "prior" distribution when we calibrate the model to the next batch of data Z3. In our example, the equivalent plots after additionally incorporating the information in the data at time t=2 are

pfexamscat2.png

This procedure is repeated until all batches of data have been incorporated. The figures below show the evolution of the sequential algorithm after incorporating the data up until times t=10, t=20 and finally t=30 are

pfexamscat3.png

pfexamscat4.png

pfexamscat5.png

Clearly, even after using all the data, there remains a reasonable amount of uncertainty about appropriate values for the rate constants in the model. This is often the case and is due to both errors in measuring the data (described by the measurement error model) and the inherent uncertainty in the biological mechanism (described by the stochastic process). Notice also that there is a high level of correlation between c5 and c6, as there also is between all the unknown rate constants. This too is common. Although there still is a reasonable amount of uncertainty in the appropriate values of c5 and of c6, there are relatively few combinations of c5 and c6 that are plausible.

What if the model is slow to simulate?

The methods described above rely on being able to simulate many thousands (or millions) of realisations from the model and so these realisations need to be quick to produce if these methods are to be of practical use. When forward simulation from the model is not fast enough then one option is to replace the model by an approximate model for which forward simulation is fast. Another approach (which is popular in the analysis of deterministic computer models) is to emulate the model of interest. In the terminology of the computer model literature, the algorithm for producing forward simulations from the model is called the simulator. The emulator is constructed by fitting a flexible statistical model to the output from the simulator as a function of its input parameters U. Here realisations of the statistical model have to be fast to produce (and/or the model facilitates analytical analysis). At the simplest level, we can replace forward simulations from the model of interest (the simulator) by simulations from the emulator in the static MCMC and sequential MCMC schemes outlined above. If the emulator is analytically tractable then it is possible to construct more complicated (but therefore less generic) MCMC algorithms to sample from the posterior distribution of interest. For further details, see the section on stochastic_model_emulation.

References for further reading

CalibayesWiki: Bayesian_model_calibration (last edited 2009-04-28 09:53:40 by localhost)