Motivating MCMC

Maybe it is my communications background, it took me long time to really “understand” why we need MCMC. Consider a really simple inference problem of trying to recover state x from data D, in Bayesian inference, we try to compute the posterior distribution

p(x|D) = \frac{p(D|x) p(x)}{\int p(D|x) p(x) dx}

In all textbooks, it will just explain that the denominator is hard to compute and so we need MCMC. But wait a minute, the denominator is just p(D). It is independent of the thing that we need to estimate, namely, x. Say if we are satisfied doing MAP, then solving

\max_x p(x|D) = \max_x \frac{p(D|x) p(x)}{\int p(D|x) p(x) dx} is the same as

\max_x p(D|x) p(x)

So why we bother about the denominator?

Update: Yes, indeed. I think we don’t need to bother with that. 

In most communications (decoding) problem, the state is usually discrete or quite low dimension. We can consider all (or most) combinations of x realistically. After all, solving \max_x p(D|x) p(x) means that we need to consider all possible x.

For very high dimensional x solving MAP above can be as complex or even more complex than computing p(x|D) directly. For the latter problem, something not obvious is rarely explicitly stated in introductory textbook. The problem of computing p(x|D) is not just computing the probability alone, but which values of x are worth computing. And that’s why we need MCMC. Basically, we perturb in the x space so that we cover most of the high probability x (i.e., random walk in x space according to p(x)). So a key issue in MCMC is to design a transition procedure/Markov chain (state transition model) so that the stationary probability converge to p(x).

Leave a Reply

Your email address will not be published. Required fields are marked *