JF: The idea of MCMC methods is to wander randomly in a space of trees in such a way as to settle down into an equilibrium distsribution of trees that has the desired distribution (in this case, the Bayesian posterior).
TE: OK
JF: Imagine all possible trees, connected to each other so as to form a connected graph.
TE: OK
JF: Suppose we know the distribution of trees f(T) that we want.
TE: Hunnh?
So, putting off for a moment the search for just the right explanation that will yield enlightenment, let's return to the simulation from yesterday. We generate a series of 9 models named 'A' through 'I' with assigned likelihoods 0.1, 0.2, .. 0.9. These values were assigned randomly (for no good reason) and it happens that they are in this order: BCFHEDGIA (B = 0.1, A = 0.9).
Next we construct a Markov chain---a chain whose next step depends only on the current step and not on the previous path, according to a special rule. We wander along this chain, and at the end, we compute the frequency at which we saw the various models. The relative frequencies in the Markov chain will be proportional to their likelihoods.
In fact, the first simulation that I ran just generated the next model randomly, independent of the current position. We compared the likelihood q of the proposal to the likelihood p of the current model. If q > p, we move. If q < p, we still move but only with probability q/p, otherwise we re-sample the current model. The ratio q/p is called the acceptance ratio.
This gives the correct distribution even though it's not really MCMC. It's like throwing darts at the unit square to evaluate the area under the curve y = x2 (here). Where it fails is in searching a "tree and parameter space" that is high-dimensional and mostly very low likelihood. Then the random method cannot get enough good samples because it's too inefficient.
The second method, which is shown in the code that I posted, uses what I think is a particular example of the Metropolis version of MCMC. To make the next move, we choose a proposal whose name is adjacent in the alphabet. From 'C' we choose either 'B' or 'D', from 'I' we choose 'H' or 'A', wrapping around. We calculate the acceptance ratio and act accordingly, as above. The demo showed that this method works, it generates a sample in which the models appear in proportion to their likelihoods. It may converge on the correct distribution faster than the random method, I'm not sure yet.
The critical feature of the Metropolis version is called 'detailed balance', which means that the probability of moving from model i to model j is equal to the reverse:
P(Mj|Mi) = P(Mi|Mj)
The Hastings modification allows us to use proposal distributions in which detailed balance doesn't hold. In the modified version of the simulation (code below), I generated a list where the models are present at different frequencies in order ABCDEFGHI. Now, for all i and j:
P(Mj|Mi) ≠ P(Mi|Mj)
for example:
P(MA|MD) = 4 x P(MD|MA)
But MCMC still works if we apply a correction to the acceptance ratio. I compute what I call the "bias", the ratio between the frequencies of the proposal and the current model in the list of models from which proposals are generated. if the acceptance ratio is divided by the bias, then we get the right distribution.
Wigner has a 1960 paper called "The Unreasonable Effectiveness of Mathematics in the Natural Sciences." (wikipedia) It seems to me that the thing we need to understand about MCMC is not how it gets the right distribution of models according to their likelihoods, but why it is successful in finding peaks of probability density in a multi-dimensional space that is not unlike the galaxy---mostly very low density. I'm betting that even MCMC wouldn't work well in this case; I think there has to be some trail of probability density to follow in order to find the peaks.