### Merge branch 'mcmc' into 'master'

MCMC pratical and talk

See merge request !4
parents efa90b68 cda60047
This diff is collapsed.
This diff is collapsed.
 ## Markov chain Monte Carlo (MCMC) - Practical python # import some useful libaries import numpy as np # numeric library import matplotlib.pyplot as plt # main plotting library import seaborn as sns # additional plotting library  ### Bayesian model inversion In this practical we will do Bayesian model inversion using MCMC for some simple models. Bayesian model inversion assumes we have some forward model mapping parameters to data. We then want to invert this to map some observed data to the parameters. This inversion is done through Bayes equation (derived in the lecture): $$P(\vec{\theta}|\vec{x}) = P(\vec{x}|\vec{\theta}) P(\vec{\theta}) / P(\vec{x}),$$ where - $P(\vec{\theta}|\vec{x})$ is the probability distribution of the parameters $\vec{\theta}$ conditional on the data $\vec{x}$. This is called the posterior and is what we want to derive in Bayesian model inversion. - $P(\vec{x}|\vec{\theta})$ is the probability distribution of the data $\vec{x}$ conditional on the parameters $\vec{\theta}$. This is called the likelihood. It is defined by our forward model. - $P(\vec{\theta})$ is the probability distribution of the parameters independent of the data. This is called the prior and should reflect our prior belief of what the parameters were before seeing the data. In order to not bias the analysis, one often uses uninformative (i.e., very broad) prior distribution. A common choice is $P(\vec{\theta}) = 1$ or $P(\vec{\theta}) = 1/\vec{\theta}$. - $P(\vec{x})$ is the probability of getting the data integrated over all possible parameters. It is called the marginal likelihood or evidence and reflects how likely one is to get the data given the forward model. As we will see below, the likelihood and prior are usually given by the user based on their forward model for the data. The marginal likelihood can then be computed as: $$P(\vec{x}) = \int P(\vec{x}, \vec{\theta}) d\vec{\theta} = \int P(\vec{x}|\vec{\theta}) P(\vec{\theta}) d\vec{\theta}$$ So, we can compute the marginal likelihood as the integral of the likelihood multiplied by the prior over all possible parameters. Once we have computed this marginal likelihood (which is just a scalar given the observed data) we can divide the likelihood and prior by this number to get the posterior distribution. Unfortunately, this requires integrating over all possible parameters, which will be impractical for all but the most simple cases. So, we need an algorithm that allows us to estimate properties of the posterior without having to compute the marginal likelihood, namely MCMC ### Markov chain Monte Carlo (MCMC) #### Metropolis-Hastings MCMC algorithm In MCMC we generate samples from the posterior distribution without explicitly having to know what the posterior distribution looks like. We do this by following a random walk through parameter space using the following algorithm: - given the location of the random walk in parameter space $\vec{\theta}_n$ in step $n$, generate some candidate paraemters from a proposal distribution $\vec{\theta}_c \sim P_{\rm prop}(\vec{\theta}_c|\vec{\theta_n})$. A common choice for proposal distribution is a Gaussian centered on $\vec{\theta_n}$. - Draw a random number between 0 and 1. If this number is larger than the posterior ratio ($P(\vec{\theta}_c|\vec{x})/P(\vec{\theta}_n|\vec{x})$) accept the new point. This in effect gives an acceptance probability of: $$P_{\rm accept} = {\rm min}(P(\vec{\theta}_c|\vec{x})/P(\vec{\theta}_n|\vec{x}),1)$$ - If accepted the next point is the candidate point: $\vec{\theta}_{n+1} = \vec{\theta}_{c}$ - If rejected the next point is the original point: $\vec{\theta}_{n+1} = \vec{\theta}_{n}$ Crucially, we can compute the posterior ratio ($P(\vec{\theta}_c|\vec{x})/P(\vec{\theta}_n|\vec{x})$) without computing the marginal likelihood. This is because when we enter Bayes equation, the marginal likelihood drops out. Instead the posterior ratio is given by the multiplication of the ratio of the likelihoods and the priors: $$\frac{P(\vec{\theta}_c|\vec{x})}{P(\vec{\theta}_n|\vec{x})} = \frac{P(\vec{x}|\vec{\theta}_c)}{P(\vec{x}|\vec{\theta}_n)} \frac{P(\vec{\theta}_c)}{P(\vec{\theta}_n)}$$ Let's start with a simple example, where are data is drawn from a Gaussian distribution with unknown mean, but known standard devation: python np.random.seed(1) # set seed to get consistent results true_mu = 1.3 # unknown sigma = 1 # known x = true_mu + np.random.randn(5) * sigma # draw 5 samples  Our likelihood function in this case is: $$P(\vec{x} | \mu) = N(\mu, \sigma^2=1)$$ We will assume an uninformative prior of $P(\mu) = 1$. So the posterior ratio is given by: $$\frac{P(\mu_c|\vec{x})}{P(\mu_n|\vec{x})} = \frac{P(\vec{x}|\mu_c)}{P(\vec{x}|\mu_n)} \frac{P(\mu_c)}{P(\mu_n)} = \frac{N(\mu_c, \sigma^2=1)}{N(\mu_n, \sigma^2=1)}$$ In code this looks like: python np.random.seed(2) # set seed to get consistent results all_samples = [] current_mu = 0. # we have to start our random walk somewhere... for _ in range(10000): # let's run the MCMC for 10000 steps # Draw from Gaussian proposal function candidate = np.random.randn() + current_mu # Compute posterior ratio ratio = np.prod(np.exp(-(x - candidate) ** 2 / 2 * sigma ** 2)) / np.prod(np.exp(-(x - current_mu) ** 2 / 2 * sigma ** 2)) if np.random.rand() < ratio: # accept if random number < P_accept current_mu = candidate else: # otherwise reject current_mu = current_mu # always add new point to the chain, even if candidate got rejected! all_samples.append(current_mu)  Let's look at the results python plt.hist(all_samples, bins=51, label='MCMC samples', density=True) plt.axvline(true_mu, label='true mean', color='black') sample_mean = np.mean(x) sample_mean_std = sigma / np.sqrt(len(x)) xval = np.linspace(-4, 4, 101) * sample_mean_std + sample_mean plt.plot(xval, 1/np.sqrt(2 * np.pi * sample_mean_std ** 2) * np.exp(-(xval - sample_mean) ** 2 / (2 * sample_mean_std ** 2)), label='analytical solution') plt.legend()  Some caveats: - In practice we will nearly always work with log-probability rather than probality functions. So rather than computing the posterior ratio as ratio = np.prod(np.exp(-(x - candidate) ** 2 / 2)) / np.prod(np.exp(-(x - current_mu) ** 2 / 2)) we would compute the posterior ratio as ratio = np.exp(0.5 * (np.sum(-(x - candidate) ** 2)) - np.sum(-(x - current_mu) ** 2))). This has the advantage that the product in the first case can become extremely small, which can lead to numerical instabilities in the calculations. We will see this in action below. - The acceptance rate is given by the posterior distribution only for a *symmetric* proposal function. For asymmetric proposal functions (i.e., $P_{\rm prop}(\vec{\theta}_c|\vec{\theta_n}) \ne P_{\rm prop}(\vec{\theta}_n|\vec{\theta_c}))$), the acceptance probability will also depend on the ratio of the acceptance probabilities (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm). Here we will only investigate symmetric proposal distributions. #### Multi-parameter model example First we define a generic MCMC algorithm. We assume the user provides some function $f$ which gives the sum of the log-likelihood and log-prior given some set of parameters. python def metropolis_hastings(f, x0, nsteps=10000, step_size=1.): """MCMC using Metropolis-Hastings algorithm Args: f: function mapping vector of parameters to sum of log-likelihood and log-prior x0: starting set of parameters nsteps: how many steps to run the MCMC for step_size: size of the Gaussian proposal function Returns: 2D array (nsteps x nparams) with the Markov chain through parameter space """ params = [] current = x0 current_f = f(x0) for _ in range(nsteps): candidate = np.random.randn(len(current)) * step_size + current candidate_f = f(candidate) ratio = np.exp(candidate_f - current_f) if np.random.rand() < ratio: current = candidate current_f = candidate_f params.append(current) return np.array(params)  This is the full MCMC algorithm. Make sure you understand what is going on! Let's see this in action for another simple Gaussian example. This time we assume the variance is also unknown: $$P(\vec{x} | \mu, \sigma^2) = N(\mu, \sigma^2)$$ We will assume uninformative Jeffrey's priors of $P(\mu) = 1$ and $P(\sigma^2) = 1/\sigma^2$. python np.random.seed(1) # set seed to get consistent results true_mu = 1.3 # unknown true_std = 1 # unknown x = true_mu + np.random.randn(5) * true_std # draw 5 samples print(x) def log_likelihood_prior(params): mu, variance = params if variance < 0: return -np.inf # will cause any candidated with negative variance to be rejected. This allows one to set constraints on parameters in MCMC log_likelihood = 0.5 * np.sum(-(x - mu) ** 2 / variance) - 0.5 * len(x) * np.log(variance) # ignoring constant term log_prior = -np.log(variance) return log_likelihood + log_prior samples = metropolis_hastings(log_likelihood_prior, [0, 1])  Seaborn allows us to plot the samples from the posterior distribution in 2D (compare with lecture) python as_dict = { 'mean': samples[:, 0], 'variance': samples[:, 1], } jp = sns.jointplot(x='mean', y='variance', data=as_dict) # plot sample mean and variance as black lines jp.ax_joint.axvline(np.mean(x), color='black') jp.ax_joint.axhline(np.var(x, ddof=1), color='black')  We can characterise the posterior distribution by looking at summary measures of our samples: python print(f'P(mu > 0) = {np.sum(samples[:, 0] > 0) / samples.shape[0]}') print(f'Expectation value of mu (E(mu) = {np.mean(samples[:, 0])}) matches the sample mean ({np.mean(x)})')  Here we see that we can estimate the probability of the mean being bigger than zero by the fraction of samples bigger than zero. Similarly, we can estimate the mean of the distribution by computing the mean of the samples. Note that the precision of these estimates is limited by the number of *independent* samples. ### Automatically adjusting MCMC step size The efficiency of the MCMC algorithm will depend greatly on how closely the proposal function matches the target posterior function. The closer they match, the more efficient the algorithm. This can be best visualised using the trace: python # use the same model/data as previous section np.random.seed(1) # set seed to get consistent results fig, axes = plt.subplots(3, 1, figsize=(4, 10), sharex=True) for ax, step_size, label in zip( axes, (0.05, 1, 10), ('too small', 'decent', 'too large') ): # run MCMC three times with three different step sizes samples = metropolis_hastings(log_likelihood_prior, [0, 1], step_size=step_size) ax.plot(samples[:, 0]) # the trace is simply a line plot showing the evolution of some parameters during the MCMC ax.set_ylabel(r'$\mu$') ax.set_title(f'step size is {label}') axes[-1].set_xlabel('step time (t)')  We want subsequent samples to look independent from each other (i.e., middle panel) rather than varying very slowly (indicative of too small step size; top panel) or getting stuck (indicative of too large step size; bottom panel). Note that irrespective of step size, the distribution of MCMC will converge to the posterior distribution. With the wrong step size it just might take much, *much* longer... A nice way to ensure the step size is correct (and deal with correlated parameters) is to adjust the step size automatically. To get correlated parameters we will fit a straight line to data that has not been demeaned. So the model is: $$y = a x + b + \epsilon$$ or in terms of probabilities - Likelihood: $P(y| a, x, b, \sigma^2) = N(a x + b, \sigma^2)$ - Prior: $P(a, b, \sigma^2) = 1/\sigma^2$ python np.random.seed(1) x = np.random.randn(10) + 6 true_a = 1 true_b = 0 true_variance = 10 y = true_a * x + true_b + np.random.randn(10) * np.sqrt(true_variance) def logp(params): a, b, var = params if var < 0: return -np.inf offset = y - (a * x + b) log_likelihood = -0.5 * np.sum(offset ** 2 / var) - 0.5 * len(x) * np.log(var) log_prior = -np.log(var) return log_likelihood + log_prior plt.plot(x, y, '.') plt.xlabel('x') plt.ylabel('y') # extend x- and y-axes to include origin plt.xlim(0, plt.xlim()[1]) plt.ylim(0, plt.ylim()[1])  First we update the Metropolis-Hastings function to accept a variance matrix. The proposal function will be drawn from a Gaussian with this variance. python def metropolis_hastings_cov(f, x0, variance_matrix, nsteps=10000): """MCMC using Metropolis-Hastings algorithm Args: f: function mapping vector of parameters to sum of log-likelihood and log-prior x0: starting set of parameters variance_matrix: variance to use for Gaussian proposal function nsteps: how many steps to run the MCMC for Returns: 2D array (nsteps x nparams) with the Markov chain through parameter space """ params = [] # take square root of matrix for sampling sample_mat = np.linalg.cholesky(variance_matrix) current = x0 current_f = f(x0) for _ in range(nsteps): candidate = sample_mat @ np.random.randn(len(current)) + current candidate_f = f(candidate) ratio = np.exp(candidate_f - current_f) if np.random.rand() < ratio: current = candidate current_f = candidate_f params.append(current) return np.array(params)  Then we define a new function, which estimates the best variance for the Gaussian proposal function in an iterative manner: python def iterative_metropolis_hastings(f, x0, nsteps=10000, niter=3): """MCMC using Metropolis-Hastings algorithm Proposal variance is determined iteratively Args: f: function mapping vector of parameters to sum of log-likelihood and log-prior x0: starting set of parameters nsteps: how many steps to run the MCMC for niter: number of iterations to estimate the covariance before running the final MCMC Returns: 2D array (nsteps x nparams) with the Markov chain through parameter space """ # start with small, isotropic step size current_variance_mat = np.eye(len(x0)) * 1e-2 for _ in range(niter): # For each iteration we start by running a short (i.e., with reduced number of steps) MCMC samples = metropolis_hastings_cov(f, x0, current_variance_mat, nsteps=nsteps//10) # set the new variance matrix to the covariance of the samples of this MCMC (divided by the number of parameters) current_variance_mat = np.cov(samples.T) / len(x0) x0 = samples[-1] # also update starting position # Finally, run the full MCMC with the covariance of the proposal functions set by the last iteration return metropolis_hastings_cov(f, x0, current_variance_mat, nsteps=nsteps) # run final MCMC with all steps  Without any iterations we can see from the trace that the initial step size is clearly too small: python samples_no_iteration = iterative_metropolis_hastings(logp, [0, 0, 1.], niter=0) plt.plot(samples_no_iteration[:, 0])  However, after several iterations of updating the proposal function covariance matrix, we get a nice trace: python iter_samples = iterative_metropolis_hastings(logp, [0, 0, 1.], niter=3) plt.plot(iter_samples[:, 0])  The resulting distribution of a and b looks like: python as_dict = { 'a': iter_samples[:, 0], 'b': iter_samples[:, 1], 'variance': iter_samples[:, 2], } sns.jointplot(x='a', y='b', data=as_dict) # add true value as red star jp.ax_joint.scatter(true_a, true_b, color='r', marker='*')  Ideas for further exploration: - Figure out how many iterations are needed to get an efficient sampling of the posterior - Define your own forward model (i.e., likelihood & priors). Note that to run MCMC, you would only have to write a function that computes the sum of the log_likelihood and log_prior given some parameters. You can then use the MCMC algorithms above to sample the posterior. - Test MCMC for a model with (many) more parameters. \ No newline at end of file
 # This is the MCMC (Markov chain Monte Carlo) practical The practical is in Python. There is a Notebook version (.ipynb file) and a Markdown version (.md). To use the Notebook version, you need to start a jupyter notebook and open the file within the notebook. You may also want to just view the Markdown and copy-paste the code into your favourite Python shell.