# Bayesian model selection for beginners

I’ve recently encountered the following problem (or at least, a problem that can be formulated like this, the actual thing didn’t involve coins):

Imagine you’ve been provided with the outcomes of two sets of coin tosses.

First set: $N_1$ tosses, $n_1$ heads

Second set: $N_2$ tosses, $n_2$ heads

And we want to know whether the same coin was used in both cases.  Note that these ‘coins’ might be very biased (always heads, always tails), we don’t know.

There are classical tests for doing this, but I wanted to do a Bayesian test.  This in itself is not much of a challenge (it’s been done many times before) but what I think is interesting is how to explain the procedure to someone who has no familiarity with the Bayesian approach.  Maybe if that is possible, more people would do it.  And that would be a good thing.  Here’s what I’ve come up with:

We’ll first assume a slight simplification of the problem.  We’ll assume that there are 9 coins in a bag and the person who did the coin tossing either:

Pulled one coin out of the bag and did $N_1$ tosses followed by $N_2$ tosses

OR

Pulled one coin out of the bag, did $N_1$ tosses, put the coin back, pulled out another coin (could have been the same one) and did $N_2$ tosses with this one.

The coins in the bag all have different probabilities of landing heads.  The probabilities for the 9 coins are: 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9.  We’ll label these coins $c_1$ to $c_9$.

Model 1

We’ll start by building a model for each scenario.  The first model assumes that both sets of coin toss data were produced by the same coin.  If we knew which coin it was, the probability of seeing the data we’ve seen would be:

$Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n)$

i.e. the probability of seeing $n_1$ heads in the first $N_1$ tosses of coin $c_n$ multiplied by the probability of seeing $n_2$ heads in the second $N_2$ tosses of coin $c_2$.

In reality we don’t know which coin it was. In this case, it seems sensible to calculate the averageprobability. The total of all of the probabilities divided by the number of coins. This is computed as:

$p_1=\sum_{n=1}^9 \frac{1}{9} Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n)$

Another way of thinking about this is as a weighted sum of the probabilities where each weight is the probability of that coin being chosen. In this case, each coin was equally likely and so has probability 1/9. In general this doesn’t have to be the case.

Note that we haven’t defined what $Pr(n_1|N_1,c_n)$ is. This isn’t particularly important – it’s just a number that tells us how likely we are to get a particular number of heads with a particular coin.

Model 2
The second model corresponds to picking a coin randomly for each set of coin tosses. The probability of the data in this model, if we knew that the coins were $c_n$ and $c_m$ is:

$Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)$

Again, we don’t know $c_n$ or $c_m$ so we average over them both. In other words, we average over every combination of coin pairs. There are 81 such pairs and each pair is equally likely:

$p_2 = \sum_{n=1}^9 \sum_{m=1}^9 \frac{1}{81} Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)$

Comparing the models

We can now compare the numbers $p_1$ and $p_2$. If one is much higher than the other, that model is most likely. Lets try an example…

Example 1
Here’s the data:

$N_1=10,n_1=9,N_2=10,n_2=1$

So, 9 heads in the first set of tosses and only 1 in the second. At first glance, it looks like a different coin was used in both cases. Lets see if our calculations agree. Firstly, here is $Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n)$ plotted on a graph where each bar is a different $c_n$.

The coin that looks most likely is the one with $c_n=0.5$. But, look at the scale of the y-axis. All of the values are very low – none of them look like they could have created this data. It’s hard to imagine taking any coin, throwing 9 heads in the first ten and then 1 head in the second ten.

To compute $p_1$ we add up the total height of all of the bars and divide the whole thing by 9: $p_1=0.000029$.

We can make a similar plot for model 2:

The plot is now 3D, because we need to look at each combination of $c_n$ and $c_m$. The height of each bar is:

$Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)$

The bars are only high when $c_n$ is high and $c_m$ is low. But look at the height of the high bars – much much higher than those in the previous plot. Some combinations of two coins could conceivable have generated this data. To compute $p_2$ we need to add up the height of all the bars and divide the whole thing by 81: $p_2 = 0.008478$

This is a small number, but almost 300 times bigger than $p_1$. The model selection procedure has backed up our initial hunch.

Example 2

$N_1=10,n_1=6,N_2=10,n_2=5$

i.e. 6 heads in the first 10, 5 heads in the second 10. This time it looks a bit like it might be the same coin. Here’s the plot for model 1:

And here’s the magic number: $p_1 = 0.01667$

Model 2:

With the number: $p_2 = 0.01020$.

This time, the model slightly favours model 1.

Why so slightly? Well, the data could have come from two coins – maybe the two coins are similar. Maybe it was the same coin selected twice. For this reason, it is always going to be easier to say that they are different than that they are the same.

Why would it ever favour model 1?

Given that we could always explain the data with two coins, why will the selection procedure ever favour one coin? The answer to this is one of the great benefits of the Bayesian paradigm. Within the Bayesian framework, there is an inbuilt notion of economy: a penalty associated with unnecessarily complex models.

We can see this very clearly here – look at the two plots for the second example – the heights of the highest bars are roughly the same: the best single coin performs about as well as the best pair of coins.  But, the average single coin is better than the average pair of coins.  This is because the total number of possibilities (9 -> 81) has increased faster than the total number of good possibilites. The average goodness has decreased.

Model 2 is more complex than model 1 – it has more things that we can change (2 coins rather than 1). In the Bayesian world, this increase in complexity has to be justified by much better explanatory power. In example 2, this is not the case.

In example 1 we don’t quite see the reverse effect. Now, model 1 had no good possibilities (no single coin could have realistically generated the data) whereas model 2 had some. Hence, model 2 was overwhelmingly more likely than model 1.

Summary

This is a simple example. It also doesn’t explain all of the important features of Bayesian model selection. However, I think it nicely shows the penalty that Bayesian model selection naturally imposes on overly complex models. There are other benefits to this approach that I haven’t discussed. For example, this doesn’t show the way in which $p_1$ and $p_2$ change as more data becomes available.

Relating back to Bayes rule

The final piece of the jigsaw is relating what we’ve done above to the equations that might accompany other examples of Bayesian model selection.  Bayes rule tells us that for model 1:

$Pr(c_n|n_1,N_1,n_2,N_2) = \frac{Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n)Pr(c_n)}{\sum_{o=1}^9 r(n_1|N_1,c_o)Pr(n_2|N_2,c_o)Pr(c_o)}$

and for model 2:

$Pr(c_n,c_m|n_1,N_1,n_2,N_2) = \frac{Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)Pr(c_n)Pr(c_m)}{\sum_{o=1}^9 \sum_{q=1}^9 Pr(n_1|N_1,c_o)Pr(n_2|N_2,c_q)Pr(c_o)Pr(c_q)}$
for model 2.

In both cases, we computed the denominator of the fraction – known as the evidence. In general, it may consist of an integral rather than a sum but the idea is the same: averaging over all parameters (coins or combinations of coins) to get a single measure of goodness for the model.