Motivating MCMC – Random Thought

Maybe it is my communications background, it took me long time to really “understand” why we need MCMC. Consider a really simple inference problem of trying to recover state $x$ from data $D$ , in Bayesian inference, we try to compute the posterior distribution

$p(x|D) = \frac{p(D|x) p(x)}{\int p(D|x) p(x) dx}$

In all textbooks, it will just explain that the denominator is hard to compute and so we need MCMC. But wait a minute, the denominator is just $p(D)$ . It is independent of the thing that we need to estimate, namely, $x$ . Say if we are satisfied doing MAP, then solving

$\max_x p(x|D) = \max_x \frac{p(D|x) p(x)}{\int p(D|x) p(x) dx}$ is the same as

$\max_x p(D|x) p(x)$

~~So why we bother about the denominator?~~

Update: Yes, indeed. I think we don’t need to bother with that.

In most communications (decoding) problem, the state is usually discrete or quite low dimension. We can consider all (or most) combinations of $x$ realistically. After all, solving $\max_x p(D|x) p(x)$ means that we need to consider all possible $x$ .

For very high dimensional $x$ solving MAP above can be as complex or even more complex than computing $p(x|D)$ directly. For the latter problem, something not obvious is rarely explicitly stated in introductory textbook. The problem of computing $p(x|D)$ is not just computing the probability alone, but which values of $x$ are worth computing. And that’s why we need MCMC. Basically, we perturb in the $x$ space so that we cover most of the high probability $x$ (i.e., random walk in $x$ space according to $p(x)$ ). So a key issue in MCMC is to design a transition procedure/Markov chain (state transition model) so that the stationary probability converge to $p(x)$ .

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Leave a Reply Cancel reply