One of my courses recently introduced Markov Chain Monte Carlo (MCMC) sampling, which has a lot of applications. I’d like to dive into those applications in a future post, but for now let’s take a quick look at Metropolis-Hastings MCMC.
A Brief Prologue
Let’s say we have a probability distribution function (within a mulplicative constant) that is very complex. We have an equation, but maybe it is impossible to integrate. Somehow, we’d like to draw samples from this distribution to estimate things like its average (expected value).
Markov Chains for Sampling
Given a starting state (sample), a Markov chain can tell us the probability of going to each possible next state. So if we have a current state, we can use the Markov chain to tell us the distribution of possible next states.
Hmm. Here’s an idea: what if we built a Markov chain such that the distribution of next states exactly matched our target distribution? Then if we had an existing sample, we could use the chain to quickly find another one!
So what would this ideal Markov chain act like? How do we decide what the “right” transitions are? Well, we want transitions that in the long run will cause the distribution of the next sample to become our target distribution. That is, if we pick any starting sample and run the Markov chain far into the future, we expect that the probability of states at that far horizon time to mirror our target distribution.
Therefore, after we run the Markov chain a bunch of times, the next sample in the sequence will no longer be coupled or highly dependent on the first sample we started with, but will appear to come from the “stationary distribution,” which is our target distribution. The distribution is “stationary” because the Markov chain always ends up there in the long run.
Since we know that the distribution should be stationary, the first step in finding this transition probability is to figure out how to guarantee this. It turns out this is pretty easy. The transition should meet the “detailed balance” requirement, which essential just means that the chance of moving from to is the same as that of moving from to .
Specifically, if we have a distribution that we know , we want to find some easily samplable conditional transition such that
This last expression (🌴) says that we want the probability of starting at and moving to (the right hand side) to equal the probability of starting at and moving to .
With all this in mind, we can boil things down to a simple question. If I have some sample and some , how do I decide if is a good next sample from the target distribution?
The Metropolis-Hastings Algorithm
This sounds like a product of some Superman villain. That might be better than the math we are about to dive into.
Let’s start off by deciding how to pick our candidate given our existing sample . We’ll opt to draw it from some distribution . In practice, this could end up being a simple normal distribution or whatever you want. We want the chance that we pick and accept this sample to equal the desired transition probability:
Remember we want to satisfy (🌴), which we can rearrange to get
A quick substitution using (🚀) yields
Moving some terms around gives us a required condition for picking an acceptance distribution:
So now the trick becomes finding an acceptance distribution that satisfies (🍕). When should we accept the sample from the proposal, given our current sample, and when should we reject it?
🏙 The Metropolis Choice
Let’s try the Metropolis choice:
Notice now that if , then and vice versa. Thus, the Metropolis choice satisfies (🍕)! Feel free to do the algebra on your own if you’re not convinced. It works. Trust me.
This is really useful and important! Why? Because we know all the values needed to calculate ⚾!! We know (within a constant, which disappears due to the fraction), and we know because we picked it ourselves (likely a normal distribution). Once we have , sampling is easy! We can just draw a uniform , then check if (which will happen exactly of the time). If this is true, then becomes our new starting sample.
Remember, though, that our new samples will only appear to come from the target distribution in the long run. So we can’t expect our first couple samples to be helpful. Some people talk about burn-in, or the amount of iterations needed before the samples actually start to converge to the target distribution. In other words, we have to give the algorithm time to “find” the target distribution.
Whew, that was a whirlwind. Hopefully, sometime soon I’ll be able to revisit this to give a higher level overview of sampling and why it is useful, but for now, this rushed explanation of MCMC and Metropolis-Hastings will have to do.
Honestly, the Wikipedia article isn’t that bad compared to most math pages, but I probably skimmed a dozen resources to piece together the MCMC and Metropolis-Hastings derivation.