Markov Chain Monte Carlo Sampling

September 23, 2020 #probability#math

One of my courses recently introduced Markov Chain Monte Carlo (MCMC) sampling, which has a lot of applications. I’d like to dive into those applications in a future post, but for now let’s take a quick look at Metropolis-Hastings MCMC.

A Brief Prologue

Let’s say we have a probability distribution function ptarget(x)p_\text{target}(x) (within a mulplicative constant) that is very complex. We have an equation, but maybe it is impossible to integrate. Somehow, we’d like to draw samples from this distribution to estimate things like its average (expected value).

Markov Chains for Sampling

Given a starting state (sample), a Markov chain can tell us the probability of going to each possible next state. So if we have a current state, we can use the Markov chain to tell us the distribution of possible next states.

Hmm. Here’s an idea: what if we built a Markov chain such that the distribution of next states exactly matched our target distribution? Then if we had an existing sample, we could use the chain to quickly find another one!

So what would this ideal Markov chain act like? How do we decide what the “right” transitions are? Well, we want transitions that in the long run will cause the distribution of the next sample to become our target distribution. That is, if we pick any starting sample and run the Markov chain far into the future, we expect that the probability of states at that far horizon time to mirror our target distribution.

Therefore, after we run the Markov chain a bunch of times, the next sample in the sequence will no longer be coupled or highly dependent on the first sample we started with, but will appear to come from the “stationary distribution,” which is our target distribution. The distribution is “stationary” because the Markov chain always ends up there in the long run.

Since we know that the distribution should be stationary, the first step in finding this transition probability is to figure out how to guarantee this. It turns out this is pretty easy. The transition should meet the “detailed balance” requirement, which essential just means that the chance of moving from xx to yy is the same as that of moving from yy to xx.

Specifically, if we have a distribution that we know ptarget(x)p_\text{target}(x), we want to find some easily samplable conditional transition p(xn+1xn)p(x_{n+1} \mid x_n) such that

ptarget(xn+1)p(xnxn+1)=ptarget(xn)p(xn+1xn)(🌴)p_\text{target}(x_{n+1}) p(x_n \mid x_{n+1}) = p_\text{target}(x_n) p(x_{n+1} \mid x_n) \tag{🌴}

This last expression (🌴) says that we want the probability of starting at xnx_n and moving to xn+1x_{n+1} (the right hand side) to equal the probability of starting at xn+1x_{n+1} and moving to xnx_n.

With all this in mind, we can boil things down to a simple question. If I have some sample xnx_n and some xn+1x_{n+1}, how do I decide if xn+1x_{n+1} is a good next sample from the target distribution?

The Metropolis-Hastings Algorithm

This sounds like a product of some Superman villain. That might be better than the math we are about to dive into.

Let’s start off by deciding how to pick our candidate xn+1x_{n+1} given our existing sample xnx_n. We’ll opt to draw it from some distribution pproposal(xn+1xn)p_\text{proposal}(x_{n+1} \mid x_n). In practice, this could end up being a simple normal distribution N(xn,1)\mathcal N (x_n, 1) or whatever you want. We want the chance that we pick and accept this sample to equal the desired transition probability:

p(xn+1xn)=paccept(xn+1,xn)chance we accept×pproposal(xn+1xn)chance we pick(🚀)p(x_{n+1} \mid x_n) = \underbrace{p_\text{accept}(x_{n+1}, x_n)}_{\text{chance we accept}} \times \underbrace{p_\text{proposal}(x_{n+1} \mid x_n)}_{\text{chance we pick}} \tag{🚀}

Remember we want to satisfy (🌴), which we can rearrange to get

ptarget(xn+1)ptarget(xn)=p(xn+1xn)p(xnxn+1)\frac {p_\text{target}(x_{n+1})}{p_\text{target}(x_n)} = \frac {p(x_{n+1} \mid x_n)}{p(x_n \mid x_{n+1})}

A quick substitution using (🚀) yields

ptarget(xn+1)ptarget(xn)=paccept(xn+1,xn)pproposal(xn+1xn)paccept(xn,xn+1)pproposal(xnxn+1)\frac {p_\text{target}(x_{n+1})}{p_\text{target}(x_n)} = \frac {p_\text{accept}(x_{n+1}, x_n) p_\text{proposal}(x_{n+1} \mid x_n)}{p_\text{accept}(x_n, x_{n+1}) p_\text{proposal}(x_n \mid x_{n+1})}

Moving some terms around gives us a required condition for picking an acceptance distribution:

paccept(xn,xn+1)paccept(xn+1,xn)=ptarget(xn)pproposal(xn+1xn)ptarget(xn+1)pproposal(xnxn+1)(🍕)\frac {p_\text{accept}(x_n, x_{n+1})}{p_\text{accept}(x_{n+1}, x_n)} = \frac {p_\text{target}(x_n) p_\text{proposal}(x_{n+1} \mid x_n)}{p_\text{target}(x_{n+1}) p_\text{proposal}(x_n \mid x_{n+1})} \tag{🍕}

So now the trick becomes finding an acceptance distribution that satisfies (🍕). When should we accept the sample from the proposal, given our current sample, and when should we reject it?

🏙 The Metropolis Choice

Let’s try the Metropolis choice:

paccept(xn,xn+1)=min(1,ptarget(xn)pproposal(xn+1xn)ptarget(xn+1)pproposal(xnxn+1))p_\text{accept}(x_n, x_{n+1}) = \min \left(1, \frac {p_\text{target}(x_n) p_\text{proposal}(x_{n+1} \mid x_n)}{p_\text{target}(x_{n+1}) p_\text{proposal}(x_n \mid x_{n+1})}\right)

Notice now that if paccept(xn,xn+1)=<1p_\text{accept}(x_n, x_{n+1}) = ⚾ < 1, then paccept(xn+1,xn)=min(1,1)=1p_\text{accept}(x_{n+1}, x_n) = \min (1, \frac 1 ⚾) = 1 and vice versa. Thus, the Metropolis choice satisfies (🍕)! Feel free to do the algebra on your own if you’re not convinced. It works. Trust me.

This is really useful and important! Why? Because we know all the values needed to calculate ⚾!! We know ptarget()p_\text{target}(\cdot) (within a constant, which disappears due to the fraction), and we know pproposal()p_\text{proposal}(\cdot \mid \cdot) because we picked it ourselves (likely a normal distribution). Once we have paccept(,)p_\text{accept}(\cdot, \cdot), sampling is easy! We can just draw a uniform u[0,1)u \in [0, 1), then check if u<pacceptu < p_\text{accept} (which will happen exactly pacceptp_\text{accept} of the time). If this is true, then xn+1x_{n+1} becomes our new starting sample.

Remember, though, that our new samples will only appear to come from the target distribution in the long run. So we can’t expect our first couple samples to be helpful. Some people talk about burn-in, or the amount of iterations needed before the samples actually start to converge to the target distribution. In other words, we have to give the algorithm time to “find” the target distribution.

Wrapping Up

Whew, that was a whirlwind. Hopefully, sometime soon I’ll be able to revisit this to give a higher level overview of sampling and why it is useful, but for now, this rushed explanation of MCMC and Metropolis-Hastings will have to do.

References

Honestly, the Wikipedia article isn’t that bad compared to most math pages, but I probably skimmed a dozen resources to piece together the MCMC and Metropolis-Hastings derivation.


Written by Matthew Russell who follows Jesus, studies machine learning at the University of Kentucky, and interns at NASA. Get to know him or check out his projects on GitHub.