- $P(\vec{\theta}|\vec{x})$ is the probability distribution of the parameters $\vec{\theta}$ conditional on the data $\vec{x}$. This is called the posterior and is what we want to derive in Bayesian model inversion.
- $P(\vec{x}|\vec{\theta})$ is the probability distribution of the data $\vec{x}$ conditional on the parameters $\vec{\theta}$. This is called the likelihood. It is defined by our forward model.
- $P(\vec{\theta})$ is the probability distribution of the parameters independent of the data. This is called the prior and should reflect our prior belief of what the parameters were before seeing the data. In order to not bias the analysis, one often uses uninformative (i.e., very broad) prior distribution. A common choice is $P(\vec{\theta}) = 1$ or $P(\vec{\theta}) = 1/\vec{\theta}$.
- $P(\vec{x})$ is the probability of getting the data integrated over all possible parameters. It is called the marginal likelihood or evidence and reflects how likely one is to get the data given the forward model.
As we will see below, the likelihood and prior are usually given by the user based on their forward model for the data.
So, we can compute the marginal likelihood as the integral of the likelihood multiplied by the prior over all possible parameters.
Once we have computed this marginal likelihood (which is just a scalar given the observed data) we can divide the likelihood and prior by this number to get the posterior distribution.
Unfortunately, this requires integrating over all possible parameters, which will be impractical for all but the most simple cases.
So, we need an algorithm that allows us to estimate properties of the posterior without having to compute the marginal likelihood, namely MCMC
### Markov chain Monte Carlo (MCMC)
#### Metropolis-Hastings MCMC algorithm
In MCMC we generate samples from the posterior distribution without explicitly having to know what the posterior distribution looks like.
We do this by following a random walk through parameter space using the following algorithm:
- given the location of the random walk in parameter space $\vec{\theta}_n$ in step $n$, generate some candidate paraemters from a proposal distribution $\vec{\theta}_c \sim P_{\rm prop}(\vec{\theta}_c|\vec{\theta_n})$. A common choice for proposal distribution is a Gaussian centered on $\vec{\theta_n}$.
- Draw a random number between 0 and 1. If this number is larger than the posterior ratio ($P(\vec{\theta}_c|\vec{x})/P(\vec{\theta}_n|\vec{x})$) accept the new point. This in effect gives an acceptance probability of:
- In practice we will nearly always work with log-probability rather than probality functions.
So rather than computing the posterior ratio as `ratio = np.prod(np.exp(-(x - candidate) ** 2 / 2)) / np.prod(np.exp(-(x - current_mu) ** 2 / 2))`
we would compute the posterior ratio as `ratio = np.exp(0.5 * (np.sum(-(x - candidate) ** 2)) - np.sum(-(x - current_mu) ** 2)))`.
This has the advantage that the product in the first case can become extremely small, which can lead to numerical instabilities in the calculations. We will see this in action below.
- The acceptance rate is given by the posterior distribution only for a *symmetric* proposal function.
For asymmetric proposal functions (i.e., $P_{\rm prop}(\vec{\theta}_c|\vec{\theta_n}) \ne P_{\rm prop}(\vec{\theta}_n|\vec{\theta_c}))$), the acceptance probability will also depend on the ratio of the acceptance probabilities (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm). Here we will only investigate symmetric proposal distributions.
#### Multi-parameter model example
First we define a generic MCMC algorithm. We assume the user provides some function $f$ which gives the sum
of the log-likelihood and log-prior given some set of parameters.
print(f'Expectation value of mu (E(mu) = {np.mean(samples[:,0])}) matches the sample mean ({np.mean(x)})')
```
Here we see that we can estimate the probability of the mean being bigger than zero by the fraction of samples bigger than zero. Similarly, we can estimate the mean of the distribution by computing the mean of the samples.
Note that the precision of these estimates is limited by the number of *independent* samples.
### Automatically adjusting MCMC step size
The efficiency of the MCMC algorithm will depend greatly on how closely the proposal function matches the target posterior function.
The closer they match, the more efficient the algorithm.
This can be best visualised using the trace:
```python
# use the same model/data as previous section
np.random.seed(1)# set seed to get consistent results
ax.plot(samples[:,0])# the trace is simply a line plot showing the evolution of some parameters during the MCMC
ax.set_ylabel(r'$\mu$')
ax.set_title(f'step size is {label}')
axes[-1].set_xlabel('step time (t)')
```
We want subsequent samples to look independent from each other (i.e., middle panel) rather than varying very slowly (indicative of too small step size; top panel) or getting stuck (indicative of too large step size; bottom panel). Note that irrespective of step size, the distribution of MCMC will converge to the posterior distribution. With the wrong step size it just might take much, *much* longer...
A nice way to ensure the step size is correct (and deal with correlated parameters) is to adjust the step size automatically.
To get correlated parameters we will fit a straight line to data that has not been demeaned. So the model is:
$$y = a x + b + \epsilon$$
or in terms of probabilities
- Likelihood: $P(y| a, x, b, \sigma^2) = N(a x + b, \sigma^2)$
- Figure out how many iterations are needed to get an efficient sampling of the posterior
- Define your own forward model (i.e., likelihood & priors). Note that to run MCMC, you would only have to write a function that computes the sum of the log_likelihood and log_prior given some parameters. You can then use the MCMC algorithms above to sample the posterior.
- Test MCMC for a model with (many) more parameters.