Scottish Parliament Visualisation

I’ve spent a little time in the last 24 hours (with much help from John Williamson) scraping MSP division (vote) data from the Scottish Parliament website. Divisions are held when the MPs do not agree on a particular motion being put before them. Each MP votes for or against the motion, and the majority wins. The division data is held within the official parliamentary records (essentially a dump of everything said during the day). Here’s an example containing several divisions:

http://www.scottish.parliament.uk/parliamentarybusiness/28862.aspx?r=8256&mode=html

If you know any html, take a look at the source, and marvel at how poorly structured it is (how about defining a div or span class for a division?? Or, as a start, how about closing your divs??). Also note the opaque numbering system (change the number after ‘r=’ in the link to get other records, seemingly randomly ordered).

Ranting aside, I think we finally managed to extract all division data for the current parliament (since 5th May 2011). Assuming I haven’t messed it all up (I will check…), since 5th May 2011 133 different MPs have voted in 580 divisions casting a total of 66,310 votes. I’m happy to share the data or the Python used to scrape it if anyone else wants it.

The first thing I’ve done with the data is visualise the MPs in two dimensions (apologies for the rubbish plot, clicking on it makes it slightly better, labelling the MPs and votes would probably be useful):

2Dvis

In this 2D world, each MP is a point (coloured according to standard party convention, grey are independents), and each vote is a line that splits the MPs who voted ‘for’ the motion from those who voted ‘against’. Given the large number of MPs and votes, it’s incredibly unlikely that it will be possible to position all of the MPs and votes such that all MPs are on the correct side of all votes. However, we can use some computational magic to position them in a manner that gets as many right as possible (in this case, 97.8% of the MP-vote pairs are correct). Once they have been placed, the position of the MPs tells us something about how they vote. For example, they will be near other MPs who have similar voting patterns, and far away from those that have very different ones.

I haven’t had the chance to look at this in much detail yet but the one thing that stands out is the homogeneity of the main parties (their members are very bunched together). In the Westminster parliament, a substantial minority of MPs regularly rebel against their parties resulting in much broader clouds of points (e.g. here is a plot of Westminster MPs since 2010: 2010plain and in the 2005 parliament: 2005plain). The most rebellious MP in the Scottish Parliament is Christine Grahame (SNP) with 6 rebellions (defined as not voting with the party majority) out of 544 votes (~1%). Compare that with Labour MP Dennis Skinner who, up to the end of the 2005 Westminster Parliament had rebelled a total of 273 times (source). In fact, there are so few that I can give you a list of all those MPs who have rebelled more than once:

  • Grahame, Christine (SNP, Midlothian South, Tweeddale and Lauderdale) 6 (544)
  • Murray, Elaine (Lab, Dumfriesshire) 5 (529)
  • Gibson, Kenneth (SNP, Cunninghame North) 4 (569)
  • Malik, Hanzala (Lab, Glasgow) 4 (521)
  • Boyack, Sarah (Lab, Lothian) 3 (523)
  • Chisholm, Malcolm (Lab, Edinburgh Northern and Leith) 3 (569)
  • Cunningham, Roseanna (SNP, Perthshire South and Kinross-shire) 3 (536)
  • Henry, Hugh (Lab, Renfrewshire South) 3 (486)
  • Smith, Drew (Lab, Glasgow) 3 (517)
  • Wilson, John (SNP, Central Scotland) 3 (558)
  • McGrigor, Jamie (Con, Highlands and Islands) 3 (535)
  • Dugdale, Kezia (Lab, Lothian) 2 (545)
  • Eadie, Jim (SNP, Edinburgh Southern) 2 (564)
  • Ferguson, Patricia (Lab, Glasgow Maryhill and Springburn) 2 (492)
  • McLeod, Fiona (SNP, Strathkelvin and Bearsden) 2 (530)
  • Park, John (Lab, Mid Scotland and Fife) 2 (287)
  • Urquhart, Jean (SNP, Highlands and Islands) 2 (274)
  • Carlaw, Jackson (Con, West Scotland) 2 (484)
  • Scanlon, Mary (Con, Highlands and Islands) 2 (553)
  • Fraser, Murdo (Con, Mid Scotland and Fife) 2 (511)
  • Milne, Nanette (Con, North East Scotland) 2 (536)

I’ll leave it with you to decide whether that’s a good or bad thing.

(Note 1: my definition of rebellion is not great – in Westminster, a vote is only rebellious if it is whipped (i.e. the party declare that all MPs have to vote a certain way) and not all votes are. Some of those above may have been on un-whipped votes.)

Two MPs appear on the plot above twice – John Finnie and Jean Urquhart. They both resigned from the SNP in late October 2012 and so appear once for their votes as members of the SNP and once as Independent MPs. In fact, their independent incarnations can be seen as the two grey circles SouthWest of the main chunk of SNP MSPs (yellow).

A final observation is how much voting is done. In the Westminster parliament attendance at votes is highly variable. In Holyrood it seems much more consistent. The median number of votes case by MPs is 530 (out of 580; 91%) and the 25th and 75th percentiles are 492 (85%) and 558 votes (96%) respectively.

Note 2: An added complication with the Scottish Parliament is that MPs can actively abstain from a vote. The model deals with this by defining an exclusion zone around the line, within which only abstainers can be placed. Below are some examples of individual votes with the MPs now coloured according to how they voted (red = against, green = abstain, blue = for, grey = didn’t vote). The solid line is the vote line, the dashed ones showing the exclusion zone. In these examples (as with most), the visualisation has done a decent job of separating the fors and againsts (reds and blues), with predominantly only the abstains and non-voters appearing in the exclusion zone.

eg1

eg2

eg3

Can you solve this probability question?

A colleague asked me the following probability question:

Suppose I have N empty boxes and in each one I can place a 0 or a 1. I randomly choose n boxes in which to place a 1. If I start from the first box, what is the probability that the first 1 I find will be in the mth box?

This is depicted in the following graphic:

blah

and we’re interested in the probability that the first 1 (reading from the left) is in, say, the third box.

I’m sure there must be a really nice form to the solution, but I can’t come up with one. The best I can do is the following slightly clunky one – any better ideas?

The probability of the first 1 being in the mth box is equal to the number of configurations where the first 1 is in the mth box, divided by the total number of configurations. The total number of configurations is given by:

\left(\begin{array}{c}N\\n\end{array}\right) = \frac{N!}{n!(N-n)!}

and the number of configurations that have their first 1 in the mth box is equal to the subset of configurations that start 0,0,0,…,0,1. This is the same as the number of configurations of (n-1) 1s in (N-m) boxes, i.e. the different ways of building the sequence after the first 1:

\left(\begin{array}{c}N-m\\n-1\end{array}\right) = \frac{N-n!}{(n-1)!(N-m-(n-1))!}

So, the probability of the mth being the first 1 is:

P(m) = \frac{\left(\begin{array}{c}N-m\\n-1\end{array}\right)}{\left(\begin{array}{c}N\\n\end{array}\right)}

Here’s what it looks like for a few different values of n, when N=100:

n=2:
n2N100

n=5:
n5N100

n=20:
n20N100

which make sense – the more 1s you have, the more likely you’ll get one early. So, can anyone produce a neater solution? I’m sure it’ll just involve transforming the problem slightly.

Footnote: because the first one has to occur somewhere between the 1st and (N-n)th position,
\sum_{m=1}^{N-n} P(m) = 1
and therefore:
\left(\begin{array}{c}N\\n\end{array}\right) = \sum_{m=1}^{N-n}\left(\begin{array}{c}N-m\\n-1\end{array}\right),
which seems surprising to me. Although I don’t know why.

Footnote2: Here’s another way of computing it, still a bit clumsy. The problem is the same as if we had N balls in a bag, n of which are red and (N-n) of which are black. If we start pulling balls out of the bag (and not replacing them), the probability that the first red one appears on the mth draw is equal to the probability of drawing m-1 black ones and then drawing one red one. The first probability can be computed from the hyper-geometric distribution as:
\frac{\left(\begin{array}{c}n\\ 0 \end{array}\right) \left(\begin{array}{c}N-n\\ m-1-0 \end{array}\right)}{\left(\begin{array}{c}N\\ m-1 \end{array}\right)}
and the second probability is just equal to the probability of picking one of the n reds from the remaining N-(m-1) balls:
\frac{n}{N-(m-1)}
So the full probability is:
P(m) = \frac{n}{N-(m-1)} \times \frac{\left(\begin{array}{c}n\\ 0 \end{array}\right) \left(\begin{array}{c}N-n\\ m-1-0 \end{array}\right)}{\left(\begin{array}{c}N\\ m-1 \end{array}\right)}
which is arguably messier than the previous one.

Bad Science. Literally.

The following article recently appeared on the Website of ‘Science’, one of the premier scientific journals:
Who’s Afraid of Peer Review?
The issue was also picked up by the Guardian newspaper:
Hundreds of open access journals accept fake science paper

The articles describe how a hoax paper submitted to several hundred Open Access journals (journals whose articles are available free of charge to anyone) was accepted by a large number of them, sometimes without any peer review. This is obviously bad.

Although the article raises some important issues it seems wholly unfair to label these issues as Open Access issues. I’m confident that there are plenty of shifty journals out there (an unfortunate side-effect of the (also unfortunate) increasing pressure to publish more and more), some of which are OA and some of which are not. To make claims about OA, the Science hoax needed to also submit the paper to a large number of non OA journals and compare the acceptance rates. In fact, it’s very surprising that Science published a study that was so, well, unscientific. Perhaps it’s a double hoax?

Science, by the way, is not OA. If you don’t work in a University that has a subscription, it’ll cost you $150 per year to read the research that is mainly funded through public funds. The current OA system isn’t perfect but it feels like things are moving in the right direction.

If you want more, the hoax has already been picked apart by e.g.: Open access publishing hoax: what Science magazine got wrong, and I’m sure there will be others.

Visualising touch errors

Last year (I’ve been meaning to write this for a while), Daniel Buschek spent a few months in the IDI group as an intern. His work here resulted in an interesting study looking at whether the variability of touch performance on mobile phones between users and between devices. Basically, we were interested in seeing if an algorithm trained to improve a user’s touch performance on one phone could be translated to another phone. To find the answer, you’ll have to read the paper…

During the data collection phase of the project, Daniel produced some little smartphone apps for visualising touch performance. For example, on a touch-screen smartphone navigate to:

http://dcs.gla.ac.uk/taps/demos/tapping2/

Hold your phone in one hand and repeatedly touch the cross-hair targets with the thumb of the hand you’re holding the phone with. Once you’ve touched 50 targets the screen will change. If you’re going to do it, do it now before you read on!


Ok, so now you’ll see something that looks a bit like this:
photo

This is a visualisation of touch accuracy (or lack of) – the lines moving to the right (from green to red) demonstrate that I typically touch some distance to the right of where I’m aiming (this is common for holding the phone with the right hand and using the right thumb).

Here’s how it works: at the start, the app chooses 5 targets (the green points). These are presented to you in a random order. When you touch, the app records where you touched and replaces the original target location with the position where you actually touched. i.e. there are still five targets, but one of them has moved. Once you’ve seen (and moved) all five targets, it loops through them again showing you the new targets and again replacing with the new touch. Because most people have a fairly consistent error between where they are aiming and where they actually touch, plotting all of the targets results in a gradual drift in one direction, finishing (after 50 touches, 10 for each trace) with the red points. Note that not all people have this consistency (something discussed in Daniel’s paper).

To demonstrate how the errors vary, here’s another one I did, but this time with the left hand where the offset is now in a completely different direction:photo

One particularly odd feature (for me anyway) is that I know that I’m touching with errors, but don’t seem to be able to correct it! The fact that the errors are so different for left-hand and right-hand use points to a problem in designing methods to correct for errors – we need to which hand the user is using.

If you’d like to try one target rather than 5, use this link:
http://daniel-buschek.de/demos/tapping2/

If you’re interested in why there is an offset, then this paper by Christian Holz and Patrick Baudisch is a good starting point.

Running…

In the Machine Learning course I teach (designed with Mark Girolami), I use Olympic 100m winning times to introduce the idea of making predictions from data (via linear regression). Here’s the data, with the linear regression line projected into the future. Black is men, red women (dashed lines represent the plus and minus three standard deviation region):
menvwomen0
There are two reasons for using this data. The first is that it’s nice and visual and something students can relate to. The second is that it opens up a discussion about some of the bad things you can do when building predictive models. In particular, what happens if we move further into the future. Here’s the same plot up to 2200:
menvwomen
Due to the fact that women have historically been quicker at getting faster, the two lines cross at the 2156 Olympics. At about this point in the lectures a vocal student will pipe up and point out that assuming the trend continues this far into the future looks a bit iffy. And if we assume a linear trend indefinitely, we eventually get to the day where the 100m is won in a negative number of seconds:
menvwomen2
Hopefully you’ll agree that we should be fairly sceptical of any analysis that says this could happen. So I was quite surprised to discover when reading David Epstein’s The Sports Gene that academics had done exactly this. And published it in Nature. NATURE! Arguably the top scientific journal IN THE WORLD. If you don’t believe me, here it is:
Momentous sprint at the 2156 Olympics.

Rant: Politicians and YouGov polls

Louise Mensch is the Conservative MP for Corby.  Today she tweeted this:

RT @RicHolden: #YouGov polling numbers for female voters: #Conservatives 43%; #Labour 41%; LD 9%; Oth 7%. #feminism #VoteConservative

There is not a lot of difference between 41% and 43%.  The polling companies can’t feasibly ask every woman in the UK.  If you only ask a subset then small differences might just be because you happened to ask an unrepresentative sample.

Here is the actual poll data:

http://cdn.yougov.com/cumulus_uploads/document/rgjtpnh1ug/YG-Archives-Pol-Sun-results-090112.pdf

which says that they asked 1727 people.  That’s a reasonable number of people.  If you are statistically minded, you could work out if 1727 people are enough to make any conclusions about a 2% difference.

If you’re not statistically minded (or are, but are lazy), you can instead look at the same survey done just 3 days earlier:

http://cdn.yougov.com/cumulus_uploads/document/d51j2t3jzl/YG-Archives-Pol-ST-results-06-080112.pdf

and discover that three days ago, it was Lab 42%, Con 37%!

I can think of two reasons for this:

1: Women have suddenly (IN THE LAST 3 DAYS) started a mass Labour exodus (Ed Milliband has been pretty useless).

2: Polling a small number of people isn’t very reliable and making a big deal about it when it favours your particular cause makes you look a bit silly.

I’m inclined to go for the latter, although it would be nice to think that female support for UKIP is a fifth of what it was 3 days ago.  And that the 1% of males who would have voted for the BNP on the 6th Jan have since been culled/educated.

Two interesting features do stand out from the data:

1. Women are consistently more likely to give the answer ‘don’t know’ than men.  I don’t believe that women really ‘don’t know’ more (or that men ‘know’ more) so did a quick search to see if this is something that has been studied.  Could only find this (didn’t look for long), which looks at it for the quite specific question “do you snore” (!!): Correlates of the “don’t know” response to questions about snoring.

2. Older people become much less likely to say ‘don’t know’ as they get older.  Actually, I know some old people and this isn’t surprising.

Visualising parliamentary data…again

In a previous post, I described a very general binary PCA algorithm for visualising voting data from the House of Commons.

To recap, we start with voting data which tells us how each of the 600 or so MPs voted in each of the 1400 or so votes of each parliament. An MP can vote for a particular motion or against it, or may not vote at all. The algorithm converts this into a two-dimensional plot where each MP is represented by a point. MPs close together have similar voting patterns, and those far apart don’t. See this for a better description.

I’ve now extended the algorithm a bit – not time to go into details now – but the outcome is that it produces a visualisation that incorporates a degree of uncertainty in the location of the MPs. This uncertainty can come from one of two sources:
1. Lack of data: if MPs don’t vote very much, we can’t be sure about where they should be plotted (a nice example are the three deputy speakers in 2010plain – big ellipses near the centre).
2. Lack of conformity: it might be hard to place some MPs in two dimensions in such a way that most of their votes are modelled correctly.

The following plots show the results for the 1997, 2001, 2005 and 2010 (up to about May 2010) parliaments. Each ellipse is an MP, and they’ve been coloured according to their party (main parties are obvious, key at the bottom for the others). The ellipses roughly represent where the MP might lie — the bigger the ellipse, the less sure we are about the location.

The lines on the plots represent the votes. MPs on one side of the line voted for the motion and on the other against. It would be nice to label some of the votes, maybe I’ll do this soon.

Anyway, here are the files. They are a bit messy, and I’ve labeled some of the MPs who I thought might be interesting. Happy to label any others if anyone is interested.

1997plain
2001plain
2005plain
2010plain

Some things that I think are interesting:
1. Clare Short before and after (Clare Short2) resigning (2005plain).
2. Ken Livingstone before and after (Ken Livingstone 2) resigning (1997plain)
3. How much the nationalist parties who were previously close to the Conservatives and Lib Dems (e.g. DUP and SNP) have now deserted them (Compare everything with 2010plain)
4. How close Clegg and Cameron are (2010plain) (couldn’t find a font small enough to separate them)
5. There appear to be more rebellious Conservatives in the coalition (2010plain) than Lib Dems (more lines (votes) dissect the blue ellipses than the yellow ones).
6. In 1997, the Lib Dems seemed to vote almost equally with Labour and with the Conservatives (roughly the same number of lines splitting Lib&Lab from Con as there are splitting Lib&Con from Lab. In 2001 and 2005, the Lib Dems seem more aligned with the Conservatives than Labour (more lines splitting Lab from Lib&Con than splitting Lib&Lab from Con).

Key:
Red: Labour
Blue: Conservative
Yellow: Lib Dem
Magenta: DUP
Orange: Plaid Cymru
Weird pinky-orange colour: Scottish National Party (normally found next to Plaid Cymru)
Green: All the rest

Was Murray unlucky?

Andy Murray played Rafael Nadal today in the semi-final of the ATP tour finale at the O2 arena in London. Nadal won a great match, 7-6 (7-5), 3-6, 7-6.

Here are the match stats (couldn’t work out how to generate a permanent link to this):

Murray won the most points (5 more than Nadal), but still lost the match.

The other numbers near the bottom of the table tell us how many points each player won when serving and when receiving. Murray served 114 points of which he won 78; Nadal served 109 of which he won 73.

I’ve been considering building a statistical model for a tennis match for a while. A model is a mathematical representation of something in the world (e.g. a tennis match). With a good model, we can learn something about the match that perhaps isn’t immediately obvious and maybe predict what might happen in the future.

The number of points won when serving/receiving could form the basis of a simple model of a tennis match. Murray won 68.4% of the points when he served, and 33.0% of the points when Nadal was serving. Using this information, and a knowledge of the tennis scoring system, it is possible to generate potential matches: for each point, we flip one of two dodgy coins (depending on who is serving) and if it lands heads, we award the point to Murray. The first coin should be designed to land heads 68.4% of the time, the second 33.0% of the time. This would be time-consuming, but fortunately, it is the kind of thing that is very fast on a computer.

I’ve generated 10,000 matches in this way, of which Murray won 5753 (57.5%). In other words, if this model is reasonably realistic (more on this later), then we would expect Murray to win more often than lose if they played at the same level (served and received as well as they did today) in the future.

We can look a bit deeper into the results and work out how likely different scorelines are:

Murray 2 – 0 Nadal: 30.3%
Murray 2 – 1 Nadal: 27.2%
Murray 1 – 2 Nadal: 22.3%
Murray 0 – 2 Nadal: 20.1%

Another interesting stat is how many points we’d expect to see. In the following plot, we can see the number of points in each of the 10,000 simulated matches (the quality is a bit low – click the image to see a better version):

The red line shows the number in the actual match: 223. Only 1.6% of the simulated matches involved 223 or more points. So, if we’re happy that our model is realistic, the match was surprisingly long.

How realistic is the model? Not very. It’s based on the assumption that all points are independent of one another. In other words, the outcome of a particular point doesn’t depend on what’s already happened – an unrealistic assumption – players are affected by previous points. It also assumes that the chance of say, Murray winning a point when he is serving is constant throughout the match. This is also unrealistic – players get tired. These caveats don’t necessarily mean that the model is useless, but they should feature prominently in any conclusions.

One way to make it more realistic, would be to use the stats from the three different sets when performing the simulation. The stats for the three sets (from Murray’s point of view) are:
Set 1: Serving 28/36 (77.8%), Receiving 7/36 (19.4%)
Set 2: Serving 18/25 (72.0%), Receiving 14/30 (46.7%)
Set 3: Serving 32/53 (60.4%), Receiving 15/43 (34.9%)

Simulating 10,000 results with this enhanced model results in Murray winning 5932 matches (59.3%). This is slightly higher than the previous result but we’d need to simulate more matches to be sure it’s not just a bit higher by chance.

Here is the breakdown of the possible match scores:

Murray 2 – 0 Nadal: 40.2%
Murray 2 – 1 Nadal: 19.1%
Murray 1 – 2 Nadal: 37.5%
Murray 0 – 2 Nadal: 3.2%

This is very different to the previous breakdown. Scores of 2-0 and 1-2 have both become more likely and a Nadal 2-0 win looks very unlikely. We can see why if we look more closely at who wins each set. The first set is pretty close: Murray wins it 42.6% of the time. The second set is incredibly one-sided with Murray winning it 94.5% of the time. If the match goes to a third set, Murray wins it 34.1% of the time.

Interestingly, if we look at the number of points again, it’s now even less likely to see 223 or more points. Only 54 matches got this long (or longer) (0.5%). I didn’t expect this – my gut feeling as to why the matches in the first model were on the whole shorter than the real one was that the model wasn’t much good. However, this new model is more realistic, and the lengths have got shorter!

One possible conclusion to the whole analysis is that the scoreline was a bit flattering to Murray – he nearly one the third set although he didn’t play particularly well in it (he could expect to win it only 34.1% of the time). To illustrate how much better Nadal played in the final set we can simulate matches using just the stats from the last set. This results in a very one-sided picture with Nadal winning 7132 (71.3%)!

Obviously the earlier criticisms of the model still apply – we’re now assuming that within a set, each player has a constant probability of winning points. It’s easy to argue that this is still insufficient. However, it certainly provides some food for thought.

Bayesian model selection for beginners

I’ve recently encountered the following problem (or at least, a problem that can be formulated like this, the actual thing didn’t involve coins):

Imagine you’ve been provided with the outcomes of two sets of coin tosses.

First set: N_1 tosses, n_1 heads

Second set: N_2 tosses, n_2 heads

And we want to know whether the same coin was used in both cases.  Note that these ‘coins’ might be very biased (always heads, always tails), we don’t know.

There are classical tests for doing this, but I wanted to do a Bayesian test.  This in itself is not much of a challenge (it’s been done many times before) but what I think is interesting is how to explain the procedure to someone who has no familiarity with the Bayesian approach.  Maybe if that is possible, more people would do it.  And that would be a good thing.  Here’s what I’ve come up with:

We’ll first assume a slight simplification of the problem.  We’ll assume that there are 9 coins in a bag and the person who did the coin tossing either:

Pulled one coin out of the bag and did N_1 tosses followed by N_2 tosses

OR

Pulled one coin out of the bag, did N_1 tosses, put the coin back, pulled out another coin (could have been the same one) and did N_2 tosses with this one.

The coins in the bag all have different probabilities of landing heads.  The probabilities for the 9 coins are: 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9.  We’ll label these coins c_1 to c_9.

Model 1

We’ll start by building a model for each scenario.  The first model assumes that both sets of coin toss data were produced by the same coin.  If we knew which coin it was, the probability of seeing the data we’ve seen would be:

Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n)

i.e. the probability of seeing n_1 heads in the first N_1 tosses of coin c_n multiplied by the probability of seeing n_2 heads in the second N_2 tosses of coin c_2.

In reality we don’t know which coin it was. In this case, it seems sensible to calculate the averageprobability. The total of all of the probabilities divided by the number of coins. This is computed as:

p_1=\sum_{n=1}^9 \frac{1}{9} Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n)

Another way of thinking about this is as a weighted sum of the probabilities where each weight is the probability of that coin being chosen. In this case, each coin was equally likely and so has probability 1/9. In general this doesn’t have to be the case.

Note that we haven’t defined what Pr(n_1|N_1,c_n) is. This isn’t particularly important – it’s just a number that tells us how likely we are to get a particular number of heads with a particular coin.

Model 2
The second model corresponds to picking a coin randomly for each set of coin tosses. The probability of the data in this model, if we knew that the coins were c_n and c_m is:

Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)

Again, we don’t know c_n or c_m so we average over them both. In other words, we average over every combination of coin pairs. There are 81 such pairs and each pair is equally likely:

p_2 = \sum_{n=1}^9 \sum_{m=1}^9 \frac{1}{81} Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)

Comparing the models

We can now compare the numbers p_1 and p_2. If one is much higher than the other, that model is most likely. Lets try an example…

Example 1
Here’s the data:

N_1=10,n_1=9,N_2=10,n_2=1

So, 9 heads in the first set of tosses and only 1 in the second. At first glance, it looks like a different coin was used in both cases. Lets see if our calculations agree. Firstly, here is Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n) plotted on a graph where each bar is a different c_n.

The coin that looks most likely is the one with c_n=0.5. But, look at the scale of the y-axis. All of the values are very low – none of them look like they could have created this data. It’s hard to imagine taking any coin, throwing 9 heads in the first ten and then 1 head in the second ten.

To compute p_1 we add up the total height of all of the bars and divide the whole thing by 9: p_1=0.000029.

We can make a similar plot for model 2:

The plot is now 3D, because we need to look at each combination of c_n and c_m. The height of each bar is:

Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)

The bars are only high when c_n is high and c_m is low. But look at the height of the high bars – much much higher than those in the previous plot. Some combinations of two coins could conceivable have generated this data. To compute p_2 we need to add up the height of all the bars and divide the whole thing by 81: p_2 = 0.008478

This is a small number, but almost 300 times bigger than p_1. The model selection procedure has backed up our initial hunch.

Example 2

N_1=10,n_1=6,N_2=10,n_2=5

i.e. 6 heads in the first 10, 5 heads in the second 10. This time it looks a bit like it might be the same coin. Here’s the plot for model 1:

And here’s the magic number: p_1 = 0.01667

Model 2:

With the number: p_2 = 0.01020.

This time, the model slightly favours model 1.

Why so slightly? Well, the data could have come from two coins – maybe the two coins are similar. Maybe it was the same coin selected twice. For this reason, it is always going to be easier to say that they are different than that they are the same.

Why would it ever favour model 1?

Given that we could always explain the data with two coins, why will the selection procedure ever favour one coin? The answer to this is one of the great benefits of the Bayesian paradigm. Within the Bayesian framework, there is an inbuilt notion of economy: a penalty associated with unnecessarily complex models.

We can see this very clearly here – look at the two plots for the second example – the heights of the highest bars are roughly the same: the best single coin performs about as well as the best pair of coins.  But, the average single coin is better than the average pair of coins.  This is because the total number of possibilities (9 -> 81) has increased faster than the total number of good possibilites. The average goodness has decreased.

Model 2 is more complex than model 1 – it has more things that we can change (2 coins rather than 1). In the Bayesian world, this increase in complexity has to be justified by much better explanatory power. In example 2, this is not the case.

In example 1 we don’t quite see the reverse effect. Now, model 1 had no good possibilities (no single coin could have realistically generated the data) whereas model 2 had some. Hence, model 2 was overwhelmingly more likely than model 1.

Summary

This is a simple example. It also doesn’t explain all of the important features of Bayesian model selection. However, I think it nicely shows the penalty that Bayesian model selection naturally imposes on overly complex models. There are other benefits to this approach that I haven’t discussed. For example, this doesn’t show the way in which p_1 and p_2 change as more data becomes available.

Relating back to Bayes rule

The final piece of the jigsaw is relating what we’ve done above to the equations that might accompany other examples of Bayesian model selection.  Bayes rule tells us that for model 1:

Pr(c_n|n_1,N_1,n_2,N_2) = \frac{Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n)Pr(c_n)}{\sum_{o=1}^9 r(n_1|N_1,c_o)Pr(n_2|N_2,c_o)Pr(c_o)}

and for model 2:

Pr(c_n,c_m|n_1,N_1,n_2,N_2) = \frac{Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)Pr(c_n)Pr(c_m)}{\sum_{o=1}^9 \sum_{q=1}^9 Pr(n_1|N_1,c_o)Pr(n_2|N_2,c_q)Pr(c_o)Pr(c_q)}
for model 2.

In both cases, we computed the denominator of the fraction – known as the evidence. In general, it may consist of an integral rather than a sum but the idea is the same: averaging over all parameters (coins or combinations of coins) to get a single measure of goodness for the model.

How complex is the parliament?

I’ve finally got around to doing some analysis on the parliamentary vote data using a model that can handle both binary data (the votes) and missing data (MPs not voting).

Intuitively, it’s quite easy to imagine how it works:

Assume we have a load of MPs and one vote, that they all attended. They can only vote yes or no (1 or 0). It is clearly possible to get all of the MPs to stand in a line such that all of them to the left of some point voted yes (1) and all to the right voted no (0). We could then characterise each MP by their position in the line and from that could work out how they voted. We’d end up with something that looks like this (each digit is an MP):

000000000111111111

Seem pointless? Yep, it would be. But what if there were two votes? If the MPs voted the same in both votes, it would be easy – we’d still only need one line, and one reference point. It would look like this (each MP is now a column, each row a vote):

V1: 000000011111111
V2: 000000011111111

If they voted completely oppositely, it would still be possible:

V1: 000000011111111
V2: 111111100000000

We can still use a single line if they vote a bit differently in two votes but two reference points will be required:

V1: 00000000111111
V2: 00001111111111

For vote 2, the reference point is slightly to the left of that for vote 1.

It’s just as easy to dream up voting patterns for which we can’t do this. For example, we can’t reorder the following MPs such that for each vote there is a point to the left of which they all vote 0 and to the right 1:

V1: 00000001111111
V2: 00011110000011

What we can do, is position the MPs in 2-dimensions – imagine placing them in a room rather than on a line. For each vote, we’ll split the room in two with a straight line such that all MPs on one side of the line vote one way, and all on the other side vote the other way.

For two votes, we could always position the MPs in two dimensions in this way (much like we could with one vote and a line). We might be able to do 3 votes in 2-dimensions, but we we can’t be sure – it depends on how they voted.

Given data for a full parliament (about 1600 votes), it’s interesting to see how well we can do this with a particular number of dimensions. For example, if we restrict ourselves to two dimensions, is it possible to lay the MPs out and draw a line for each vote such that each MP is on the correct side of each line? If it is not possible, what’s the highest number of MP-vote pairs that we can get right?

This might tell us something about the voting patterns of the parliament. For example, in the UK we have a 3 party system (or at least we have until recently). Lets assume that for each vote in say the 1997 parliament, Labour MPs voted one way, Conservative MPs the other and Lid Dems sided with the other two. If this were the case, we’d only need one dimension:

V1: 0000 1111 1111
V2: 0000 0000 1111
V3: 1111 1111 0000
V4: 1111 0000 0000

Each column is an MP, each block (set of four columns) a party (Labour Lib Dem Conservative, in that order). If we find for real data that we need only one dimension, it suggests this (or something similar) is happening.

Alternatively, if we assume that sometimes Labour and Conservatives vote together and the Lib Dems differently, we would need a second dimension.

The following plot shows the percentage of MP-vote pairs we can get right for different parliaments as we increase the number of dimensions:

(The line for 2010 should be treated with some caution – only 100 votes or so so far.)

To put the y-axis into perspective, in a parliament of 1600 votes and 600 MPs, an increase of 1% corresponds to getting approx 9000 additional votes correct – about 10 MPs worth if the MPs vote about 50% of the time.

The results suggest two things to me. Firstly, the voting patterns in each parliament are pretty simple – we can get a lot of the votes right with two dimensions. This is not surprising – most MPs will vote along party lines and we have three (main) parties. Secondly, it looks like the three successive Labour parliaments have been getting slightly more complex over time – 1997 seems to have the simplest structure, but not by much. Perhaps over time MPs became a bit more rebellious?