by Marco Tazzari (ESO)
There are many advantages of using Bayesian statistics, and exploring all of them is out of scope of the present tutorial. Bayesian parameter inference provides us with a powerful tool that stems from the notion of probability as degree of belief.
Given the set of parameters \(\vec \theta = (\theta_1,\ \theta_2,\ \dots,\ \theta_N)\) and the observed data \(d\), everything starts with the Bayes Theorem: \[ p(\vec\theta|d)\ =\ \frac{p(d|\vec\theta)\ p(\vec \theta)}{p(d)}\,, \] where:
\(p(\vec\theta|d)\) is the posterior probability distribution function (PDF) for \(\vec \theta\) (degree of belief about the value of \(\vec \theta\) after we see the data);
\(p(d|\vec\theta)\) is the likelihood of the data given the parameters (e.g. Gaussian, Poisson, etc., depending on the nature of the physical process);
\(p(\vec\theta)\) is the prior PDF for \(\vec \theta\) (degree of belief about the value of \(\vec \theta\) before we see the data);
\(p(d)\) is the so-called evidence, i.e. the normalization.
Being able to determine \(p(d)\) allows model selection. Anyway, the unnormalized estimate of \(p(\vec\theta|d)\) is useful for parameters estimate, i.e. for the present tutorial!
Eventually, we can compute the marginalized probability of each parameter, e.g. \(\theta_1\), by integrating the posterior PDF on all the other parameters: \[ p(\theta_1|d) = \int p(\vec \theta|d)\ \mathrm{d} \theta_2\dots \mathrm{d}\theta_N \]
After all, inferring the value of the parameters given our model and the observed data, is very simple and can be accomplished in 3 steps:
Specify the priors;
Estimate the likelihood;
Compute the posterior.
...but, how can we compute the posterior?
Build a series of points in the space of parameters, i.e. a chain, $ (0), (1),, (t), (t+1),, (M) $ such that the position \(\vec\theta(t+1)\) only depends on the position of \(\vec\theta(t)\).
The density of the samples in the chain is directly proportional to the posterior PDF.
The chain converges to a stationary state where successive elements of the chain are samples of the target posterior distribution \(p(\vec\theta|d)\).
This means that, once we have obtained the chain of \(M\) samples, we have everything we need. We can compute:
the marginalized distribution of each parameter \(\theta_i\) by simply approximating it with the histogram of the samples projected into the parameter space spanned by \(\theta_i\) (see below for details);
the expectation value of a function \(f\) of the model parameters: \[ f(\vec\theta)\, =\, \int p(\vec\theta|d)\ f(\vec\theta)\ \mathrm{d}\vec\theta \approx \frac{1}{M}\sum_{t=0}^{M-1} f(\vec\theta(t)) \]
There are many ways of constructing a Markov chain. Basically, at each step \(t\), the generation of the next chain element \(\vec\theta(t+1)\) is probabilistic and defined by the transition probability \(T\left[\vec\theta(t),\ \vec \theta(t+1)\right]\). There are several algorithms to compute this transition probability, e.g Hastings, Metropolis-Hastings, Stretch-move, etc...
The transition probability is, usually, a multivariate Gaussian distribution centered on \(\vec \theta(t)\) with a covariance tensor that needs to be tuned for performance. Main issues:
it is not the most efficient algorithm;
it cannot be run in parallel (it is serial by nature);
it needs the user to determine N(N+1)/2 tuning parameters (the covariance matrix elements)...too much work! Even worse: the behaviour and efficiency highly depends on this choice...
Proposed recently by Goodman & Weare (2010), the idea is to transform highly anisotropic and difficult-to-be-sampled multivariate posterior PDFs into isotropic Gaussians through an affine transformation.
An immediate advantage: the performance of an affine-invariant algorithm is insensitive to covariances among parameters! In other words, an efficient affine-invariant code performs equally well under all linear transformations of the parameters. This is cool!
This algorithm involves an ensemble \(S=\left\{X_k\right\}\) of simultaneously evolving \(K\) walkers, where the transition distribution for each walker is based on the current position of the other \(K-1\) walkers belonging to the complementary ensemble \(S_k=\left\{X_j,\ \forall j\ne k\right\}\). The position of a walker \(X_k(t)\) is updated as follows: \[ X_k(t+1) = X_j + Z(X_k(t)-X_j) \] where \(X_j \in S_k\) and \(Z\) is a random variable drawn from a distribution that does not depend on the covariances between the parameters. The proposal is accepted with probability \[ q = min\left(1, Z^{N-1}\frac{p(X_k(t+1)}{X_k(t)} \right)\,, \] thus there is a non negligible probability (\(1-q\)) to move to a less-probable region.
Hic Sunt Leones: we all now have the big temptation to parallelize this algorithm by simultaneously updating the position of the walkers rather than evolving the walkers in series...argh..this is is wrong since it violates detailed balance (MCMC condition of existence). So...
Following Foreman-Mackey et al. 2013, we can obtain a parallel version of the stretch-move algorithm by:
dividing the full ensemble in two subsets: \[ S_0=\left\{X_k, k=1,\dots,K/2\right\} \\ S_1=\left\{X_k, k=K/2+1,\dots,K\right\} \]
updating simultaneously the walkers in \(S_0\) with the stretch-move algorithm based only on the positions of the walkers in \(S_1\). Then, using the new positions in \(S_0\), update the positions in \(S_1\). In this case, (they ensure us) that the outcome is a valid step for all the walkers.
Note: the performance of this method is comparable to the above serial stretch-move algorithm, but the fact that we can now run it in parallel makes it way more powerful. Smart, isnt' it?
...one last thing:
Several approaches can be adopted to initialize the ensemble of walkers, depending on the initial knowledge of the parameters.
Uniform Distribute the initial walkers over a reasonable range in parameter space. This is the more obvious choice to avoid initial biases to our chain.
Concentrated Start the walkers from a tight N-dimensional ball, near to the point where we surmise the maximum should reside (if we have any evidence of this). This choice is much more effective and helps the chain in avoiding local minima in multi-modal space of parameters landscapes.
The emcee Python package is all we need to perform the parallel version of the Stretch-move algorithm.
The easiest way to install emcee is using pip4. Running the command
% pip install emcee
An alternative installation method is to download the source code from http://dan.iel.fm/emcee and run:
% python setup.py install
in the unzipped directory. Make sure that you have numpy installed in this case as well.
emcee is actively developed on GitHub at http://github.com/dfm/emcee.
Let's try with a simple example to start playing with emcee.
emcee is so simple to use that it is quick to setup your own mock model and do some tests.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.size']=20
# Generate synthetic data from a model.
# For simplicity, let us assume a LINEAR model y = m*x + b
# where we want to fit m and b
m_true = -0.9594
b_true = 4.294
N = 50
x = np.sort(10*np.random.rand(N))
yerr = 0.2 + 0.5*np.random.rand(N)
y = m_true*x + b_true
y += yerr * np.random.randn(N)
fig = plt.figure()
fig.set_size_inches(12, 8)
plt.errorbar(x, y, yerr=yerr, fmt='.k')
# Now, let's setup some parameters that define the MCMC
ndim = 2
nwalkers = 500
# Initialize the chain
# Choice 1: chain uniformly distributed in the range of the parameters
pos_min = np.array([-5., 0.])
pos_max = np.array([5., 10.])
psize = pos_max - pos_min
pos = [pos_min + psize*np.random.rand(ndim) for i in range(nwalkers)]
# Visualize the initialization
import triangle
fig = triangle.corner(pos, labels=["$m$", "$b$"], extents=[[-5., 5.], [0., 10.]],
truths=[m_true, b_true])
fig.set_size_inches(10,10)
# Define the posterior PDF
# Reminder: post_pdf(theta, data) = likelihood(data, theta) * prior_pdf(theta)
# We take the logarithm since emcee needs it.
# As prior, we assume an 'uniform' prior (i.e. constant prob. density)
def lnprior(theta):
m, b = theta
if -5.0 < m < 0.5 and 0.0 < b < 10.0:
return 0.0
return -np.inf
# As likelihood, we assume the chi-square. Note: we do not even need to normalize it.
def lnlike(theta, x, y, yerr):
m, b = theta
model = m * x + b
return -0.5*(np.sum( ((y-model)/yerr)**2. ))
def lnprob(theta, x, y, yerr):
lp = lnprior(theta)
if not np.isfinite(lp):
return -np.inf
return lp + lnlike(theta, x, y, yerr)
# Let us setup the emcee Ensemble Sampler
# It is very simple: just one, self-explanatory line
import emcee
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, args=(x, y, yerr))
import time
time0 = time.time()
# burnin phase
pos, prob, state = sampler.run_mcmc(pos, 300)
sampler.reset()
time1=time.time()
print time1-time0
time0 = time.time()
# perform MCMC
pos, prob, state = sampler.run_mcmc(pos, 700)
time1=time.time()
print time1-time0
samples = sampler.flatchain
samples.shape
#let's plot the results
fig = triangle.corner(samples, labels=["$m$", "$b$"], extents=[[-1.1, -0.8], [3.5, 5.]],
truths=[m_true, b_true])
fig.set_size_inches(10,10)
fig = triangle.corner(samples, labels=["$m$", "$b$"], extents=[[-1.1, -0.8], [3.5, 5.]],
truths=[m_true, b_true], quantiles=[0.16, 0.5, 0.84], show_titles=True, labels_args={"fontsize": 40})
fig.set_size_inches(10,10)
sampler.acceptance_fraction
# Plot back the results in the space of data
fig = plt.figure()
xl = np.array([0, 10])
for m, b in samples[np.random.randint(len(samples), size=100)]:
plt.plot(xl, m*xl+b, color="k", alpha=0.1)
plt.plot(xl, m_true*xl+b_true, color="r", lw=2, alpha=0.8)
plt.errorbar(x, y, yerr=yerr, fmt=".k")
fig.set_size_inches(12, 8)
# Plot back the results in the space of data
fig = plt.figure()
xl = np.array([0, 10])
for m, b in samples[-100:]: # samples[np.random.randint(len(samples[-100), size=100)]:
plt.plot(xl, m*xl+b, color="k", alpha=0.1)
plt.plot(xl, m_true*xl+b_true, color="r", lw=2, alpha=0.8)
plt.errorbar(x, y, yerr=yerr, fmt=".k")
fig.set_size_inches(12, 8)
plt.ylim(-6,6)
# Initialize the chain
# Choice 2: chain is initialized in a tight ball around the expected values
pos = [[m_true*1.2,b_true*0.8] + 1e-1*np.random.randn(ndim) for i in range(nwalkers)]
# Visualize the initialization
import triangle
fig = triangle.corner(pos, labels=["$m$", "$b$"], extents=[[-5., 5.], [0., 10.]],
truths=[m_true, b_true])
fig.set_size_inches(10,10)
import time
time0 = time.time()
# burnin phase
pos, prob, state = sampler.run_mcmc(pos, 300)
sampler.reset()
# perform MCMC
pos, prob, state = sampler.run_mcmc(pos, 700)
time1=time.time()
print time1-time0
samples = sampler.flatchain
#let's plot the results
fig = triangle.corner(samples, labels=["$m$", "$b$"], extents=[[-1.1, -0.8], [3.5, 5.]],
truths=[m_true, b_true], quantiles=[0.16, 0.5, 0.84], show_titles=True, labels_args={"fontsize": 40})
fig.set_size_inches(10,10)
# Plot back the results in the space of data
fig = plt.figure()
xl = np.array([0, 10])
for m, b in samples[np.random.randint(len(samples), size=100)]:
plt.plot(xl, m*xl+b, color="k", alpha=0.1)
plt.plot(xl, m_true*xl+b_true, color="r", lw=2, alpha=0.8)
plt.errorbar(x, y, yerr=yerr, fmt=".k")
fig.set_size_inches(12, 8)
You can define the posterior PDF function to return not only the log of the posterior PDF, but also another piece of data (string, list, arrays, more complex objects..) that are stored by emcee and can be retrieved later.
Useful for: e.g. maps or images that take time to be computed and you need later (so, in principle you want to save this time).
Caveat: pay attention to memory usage!
The implementation is simple. Just return the secondary object as follow:
def lnprobfn(p):
return -0.5 * np.sum(p ** 2), str(np.sum(p ** 3))
and then retrieve it after the run_mcmcm() command with:
pos, prob, state, blobs = sampler.run_mcmc(pos, 700)
Very simple, you just need to change one line in the definition of the sampler, adding the keyword threads:
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, args=(x, y, yerr), threads=n_threads)
where n_threads can be whichever you want, a multiple of the number of walkers to avoid aritificial slowing down of the whole chain.
Caveat: if the computation of the posterior PDF takes very short time, than going with multiprocessing can result in a slow down of the whole fit because it involves pickling the functions. Conversely, if you have a computationally intensive posterior PDF, then multiprocessing will be your greatest friend!.
I have not yet tried, if interested look here
Several methods have been proposed to quantify the performance and the convergence of MCMC, that is actually a nontrivial task. Among them, we will consider the following two: the acceptance fraction and the autocorrelation time.
It is defined as the ratio between the accepted steps over the total number of steps. There is no correct value for the acceptance fraction \(a_f\). However, for sure, are unacceptable:
\(a_f\to0\), that would mean that the chain is not proceeding and is not sampling from the posterior PDF;
\(a_f\to1\), that would mean that the chain is performing a random walk, without effectively sampling the posterior PDF.
Rule of thumb: we should accept \(0.2 \leq a_f \leq 0.5\).
It is a direct measurement of the number of the posterior PDF evaluations needed to produce independent samples of the target density. It has been shown (Goodman & Weare, 2010) that the Stretch-move algorithm has a shorter autocorrelation time, i.e. it needs fewer posterior PDF evaluations than M-H algorithm.
Definition The autocovariance of aa generic time series \(X(t)\) given a time lag \(T\) is: \[ \mathrm{cov}_f(T)=\lim_{t\to\infty} \mathrm{cov}\left[f(X(t+T)),\ f(X(t)) \right]\,, \] where \(f\) is a function of the position \(X\).
The autocorrelation time is then the time lag \(T^*\) such that \(\mathrm{cov}_f(T^*)\to 0\).
Given the MCMC, we can compute the autocovariance with \[ \mathrm{cov}_f(T)\approx \frac{1}{M-T}\sum_{m=1}^{M-T}\left[f(X(T+m))-\langle f \rangle \right]\left[f(X(m))-\langle f \rangle \right] \] Note: The longer the autocorrelation time, the larger the number of the samples we must generate to obtain the desired sampling of the posterior PDF.
tl,dr: The shorter, the better
Acceptance fraction: If the acceptance fraction is getting very low, something is going very wrong
E.g. when the posterior PDF is multi-modal. Possible solution (though quite expensive in term of human time): split the space of parameters, run two (or more) MCMC separately, and then combine them.
Number of walkers: Go large!
If you need more computing power, buy (or ask for) more CPUs!
Immediate advantage: e.g. doubling the number of walkers you obtain twice as many independent samples per autocorrelation time (i.e. for the same number of steps).
Rule of thumb: the number of walker should be given by: \[
\max\left(\min(\textrm{number of walkers such that }a_f\textrm{ during burn-in is good},\textrm{ number of samples you want}\right)
\]
Autocorrelation time:
You should run the sampler for a few (e.g. 10) autocorrelation times. After that, you are almost completely sure to have independent samples from the posterior PDF.