Bayesian model selection for beginners

I’ve recently encountered the following problem (or at least, a problem that can be formulated like this, the actual thing didn’t involve coins):

Imagine you’ve been provided with the outcomes of two sets of coin tosses.

First set: N_1 tosses, n_1 heads

Second set: N_2 tosses, n_2 heads

And we want to know whether the same coin was used in both cases.  Note that these ‘coins’ might be very biased (always heads, always tails), we don’t know.

There are classical tests for doing this, but I wanted to do a Bayesian test.  This in itself is not much of a challenge (it’s been done many times before) but what I think is interesting is how to explain the procedure to someone who has no familiarity with the Bayesian approach.  Maybe if that is possible, more people would do it.  And that would be a good thing.  Here’s what I’ve come up with:

We’ll first assume a slight simplification of the problem.  We’ll assume that there are 9 coins in a bag and the person who did the coin tossing either:

Pulled one coin out of the bag and did N_1 tosses followed by N_2 tosses

OR

Pulled one coin out of the bag, did N_1 tosses, put the coin back, pulled out another coin (could have been the same one) and did N_2 tosses with this one.

The coins in the bag all have different probabilities of landing heads.  The probabilities for the 9 coins are: 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9.  We’ll label these coins c_1 to c_9.

Model 1

We’ll start by building a model for each scenario.  The first model assumes that both sets of coin toss data were produced by the same coin.  If we knew which coin it was, the probability of seeing the data we’ve seen would be:

Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n)

i.e. the probability of seeing n_1 heads in the first N_1 tosses of coin c_n multiplied by the probability of seeing n_2 heads in the second N_2 tosses of coin c_2.

In reality we don’t know which coin it was. In this case, it seems sensible to calculate the averageprobability. The total of all of the probabilities divided by the number of coins. This is computed as:

p_1=\sum_{n=1}^9 \frac{1}{9} Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n)

Another way of thinking about this is as a weighted sum of the probabilities where each weight is the probability of that coin being chosen. In this case, each coin was equally likely and so has probability 1/9. In general this doesn’t have to be the case.

Note that we haven’t defined what Pr(n_1|N_1,c_n) is. This isn’t particularly important – it’s just a number that tells us how likely we are to get a particular number of heads with a particular coin.

Model 2
The second model corresponds to picking a coin randomly for each set of coin tosses. The probability of the data in this model, if we knew that the coins were c_n and c_m is:

Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)

Again, we don’t know c_n or c_m so we average over them both. In other words, we average over every combination of coin pairs. There are 81 such pairs and each pair is equally likely:

p_2 = \sum_{n=1}^9 \sum_{m=1}^9 \frac{1}{81} Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)

Comparing the models

We can now compare the numbers p_1 and p_2. If one is much higher than the other, that model is most likely. Lets try an example…

Example 1
Here’s the data:

N_1=10,n_1=9,N_2=10,n_2=1

So, 9 heads in the first set of tosses and only 1 in the second. At first glance, it looks like a different coin was used in both cases. Lets see if our calculations agree. Firstly, here is Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n) plotted on a graph where each bar is a different c_n.

The coin that looks most likely is the one with c_n=0.5. But, look at the scale of the y-axis. All of the values are very low – none of them look like they could have created this data. It’s hard to imagine taking any coin, throwing 9 heads in the first ten and then 1 head in the second ten.

To compute p_1 we add up the total height of all of the bars and divide the whole thing by 9: p_1=0.000029.

We can make a similar plot for model 2:

The plot is now 3D, because we need to look at each combination of c_n and c_m. The height of each bar is:

Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)

The bars are only high when c_n is high and c_m is low. But look at the height of the high bars – much much higher than those in the previous plot. Some combinations of two coins could conceivable have generated this data. To compute p_2 we need to add up the height of all the bars and divide the whole thing by 81: p_2 = 0.008478

This is a small number, but almost 300 times bigger than p_1. The model selection procedure has backed up our initial hunch.

Example 2

N_1=10,n_1=6,N_2=10,n_2=5

i.e. 6 heads in the first 10, 5 heads in the second 10. This time it looks a bit like it might be the same coin. Here’s the plot for model 1:

And here’s the magic number: p_1 = 0.01667

Model 2:

With the number: p_2 = 0.01020.

This time, the model slightly favours model 1.

Why so slightly? Well, the data could have come from two coins – maybe the two coins are similar. Maybe it was the same coin selected twice. For this reason, it is always going to be easier to say that they are different than that they are the same.

Why would it ever favour model 1?

Given that we could always explain the data with two coins, why will the selection procedure ever favour one coin? The answer to this is one of the great benefits of the Bayesian paradigm. Within the Bayesian framework, there is an inbuilt notion of economy: a penalty associated with unnecessarily complex models.

We can see this very clearly here – look at the two plots for the second example – the heights of the highest bars are roughly the same: the best single coin performs about as well as the best pair of coins.  But, the average single coin is better than the average pair of coins.  This is because the total number of possibilities (9 -> 81) has increased faster than the total number of good possibilites. The average goodness has decreased.

Model 2 is more complex than model 1 – it has more things that we can change (2 coins rather than 1). In the Bayesian world, this increase in complexity has to be justified by much better explanatory power. In example 2, this is not the case.

In example 1 we don’t quite see the reverse effect. Now, model 1 had no good possibilities (no single coin could have realistically generated the data) whereas model 2 had some. Hence, model 2 was overwhelmingly more likely than model 1.

Summary

This is a simple example. It also doesn’t explain all of the important features of Bayesian model selection. However, I think it nicely shows the penalty that Bayesian model selection naturally imposes on overly complex models. There are other benefits to this approach that I haven’t discussed. For example, this doesn’t show the way in which p_1 and p_2 change as more data becomes available.

Relating back to Bayes rule

The final piece of the jigsaw is relating what we’ve done above to the equations that might accompany other examples of Bayesian model selection.  Bayes rule tells us that for model 1:

Pr(c_n|n_1,N_1,n_2,N_2) = \frac{Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n)Pr(c_n)}{\sum_{o=1}^9 r(n_1|N_1,c_o)Pr(n_2|N_2,c_o)Pr(c_o)}

and for model 2:

Pr(c_n,c_m|n_1,N_1,n_2,N_2) = \frac{Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)Pr(c_n)Pr(c_m)}{\sum_{o=1}^9 \sum_{q=1}^9 Pr(n_1|N_1,c_o)Pr(n_2|N_2,c_q)Pr(c_o)Pr(c_q)}
for model 2.

In both cases, we computed the denominator of the fraction – known as the evidence. In general, it may consist of an integral rather than a sum but the idea is the same: averaging over all parameters (coins or combinations of coins) to get a single measure of goodness for the model.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s