Let's go back to the beginning, and start by making the standard application of Bayes rule. The joint probability of two events A and B is equal to:
Now, what we really want to talk about is not just any two events A and B, but data and models. In the figure, we sketch out an event space of all possible outcomes before seeing any data, which is divided into rectangles of different sizes (10 horizontal steps and 20 vertical steps). Each horizontal step is the same size (1/10), as is each vertical step (1/20).
Let's say we partition the event space using horizontal steps and assign each partition to a model: A through J.
Consider the purple rectangle at the extreme lower right. In this part of the event space, data z was observed and model J is correct.
The size of this rectangle---the probability P(z,J)---is one-half the size of the purple rectangle to its left, labeled 1 (for its size, which is 1% of the total). The total area of the extreme lower right rectangle: P(z,J) is 1/20 x 1/10 = 1/200 = 0.005 (i.e. 0.5%). In symbols, this calculation is:
The probability of data z, given that model J is correct, times the probability that model J is correct, is equal to the probability of both events: that we will observe the data z and that model J is correct.
Now, suppose we back up and think about all the models. We observe data z and ask, what is P(J|z)? That is, rather than asking how often model J will generate data z, we know the data already, and we want to learn the probability that model J is correct.
Using Bayes rule and rearranging terms we just do this:
Having observed data z, we can compute P(J|z). The three terms we need to do the calculation are the:
What's nice about this figure is that it's easy to see the influence of our prior evaluation of P(J). In this case, we've taken an agnostic (uninformative) prior and assumed that each model is equally probable (the horizontal width of each rectangle is the same). We could accomodate prior information by adjusting these widths.
For any well-specified model we should be able to calculate P(data|model). But by the time we're doing the calculation, we've already seen the data. So then it seems contrived to talk about the probabiity. A new term is used, the likelihood, and we talk about the likelihood L of model J given what we just observed (data z):
But what Higgs and Attwood say is different:
.. the likelihood of the data according to each of the models. This is written L(D|Mk) and is read as "the likelihood of the data, given model k." These likelihoods are defined as the probability distributions over all possible sets of data that could be described by the model. If we sum the likelihoods over all possible data for any given model, we must get one (i.e. the likelihood is normalized).
They are saying that something must have happened, and in the event space where model J is true, given that normalization (only J is true) then
They are summing vertically inside the box that is model J.
In contrast, Felsenstein uses the phrase "Likelihood Prob(D|H)", and gives the example of estimating the value of parameter p for a binomial distribution, given data of 5 heads and 6 tails. Given that the model parameter p is any particular value, we can calculate P(data|p). The maximum likelihood estimate of p works out to 5/11.
Note that although this looks rather like a distribution, it is not. It plots the probabilities of the same data D for different values of p. Thus it does not show the probabilities of different mutually exclusive outcomes, and the area under the curve need not be 1.
The equivalent observation in our case is that the bar-graph of purple rectangles in the figure shows the probabilities of the same data z for different models. These are not mutually exclusive outcomes (we could have obtained other data), and the total area of the bars need not be 1.
I'm with Felsenstein. Talk about the "likelihood of the model having observed some data", and don't assume that the likelihoods of different models given the same data should sum to 1. They don't.