Scottish Parliament Visualisation

I’ve spent a little time in the last 24 hours (with much help from John Williamson) scraping MSP division (vote) data from the Scottish Parliament website. Divisions are held when the MPs do not agree on a particular motion being put before them. Each MP votes for or against the motion, and the majority wins. The division data is held within the official parliamentary records (essentially a dump of everything said during the day). Here’s an example containing several divisions:

http://www.scottish.parliament.uk/parliamentarybusiness/28862.aspx?r=8256&mode=html

If you know any html, take a look at the source, and marvel at how poorly structured it is (how about defining a div or span class for a division?? Or, as a start, how about closing your divs??). Also note the opaque numbering system (change the number after ‘r=’ in the link to get other records, seemingly randomly ordered).

Ranting aside, I think we finally managed to extract all division data for the current parliament (since 5th May 2011). Assuming I haven’t messed it all up (I will check…), since 5th May 2011 133 different MPs have voted in 580 divisions casting a total of 66,310 votes. I’m happy to share the data or the Python used to scrape it if anyone else wants it.

The first thing I’ve done with the data is visualise the MPs in two dimensions (apologies for the rubbish plot, clicking on it makes it slightly better, labelling the MPs and votes would probably be useful):

2Dvis

In this 2D world, each MP is a point (coloured according to standard party convention, grey are independents), and each vote is a line that splits the MPs who voted ‘for’ the motion from those who voted ‘against’. Given the large number of MPs and votes, it’s incredibly unlikely that it will be possible to position all of the MPs and votes such that all MPs are on the correct side of all votes. However, we can use some computational magic to position them in a manner that gets as many right as possible (in this case, 97.8% of the MP-vote pairs are correct). Once they have been placed, the position of the MPs tells us something about how they vote. For example, they will be near other MPs who have similar voting patterns, and far away from those that have very different ones.

I haven’t had the chance to look at this in much detail yet but the one thing that stands out is the homogeneity of the main parties (their members are very bunched together). In the Westminster parliament, a substantial minority of MPs regularly rebel against their parties resulting in much broader clouds of points (e.g. here is a plot of Westminster MPs since 2010: 2010plain and in the 2005 parliament: 2005plain). The most rebellious MP in the Scottish Parliament is Christine Grahame (SNP) with 6 rebellions (defined as not voting with the party majority) out of 544 votes (~1%). Compare that with Labour MP Dennis Skinner who, up to the end of the 2005 Westminster Parliament had rebelled a total of 273 times (source). In fact, there are so few that I can give you a list of all those MPs who have rebelled more than once:

  • Grahame, Christine (SNP, Midlothian South, Tweeddale and Lauderdale) 6 (544)
  • Murray, Elaine (Lab, Dumfriesshire) 5 (529)
  • Gibson, Kenneth (SNP, Cunninghame North) 4 (569)
  • Malik, Hanzala (Lab, Glasgow) 4 (521)
  • Boyack, Sarah (Lab, Lothian) 3 (523)
  • Chisholm, Malcolm (Lab, Edinburgh Northern and Leith) 3 (569)
  • Cunningham, Roseanna (SNP, Perthshire South and Kinross-shire) 3 (536)
  • Henry, Hugh (Lab, Renfrewshire South) 3 (486)
  • Smith, Drew (Lab, Glasgow) 3 (517)
  • Wilson, John (SNP, Central Scotland) 3 (558)
  • McGrigor, Jamie (Con, Highlands and Islands) 3 (535)
  • Dugdale, Kezia (Lab, Lothian) 2 (545)
  • Eadie, Jim (SNP, Edinburgh Southern) 2 (564)
  • Ferguson, Patricia (Lab, Glasgow Maryhill and Springburn) 2 (492)
  • McLeod, Fiona (SNP, Strathkelvin and Bearsden) 2 (530)
  • Park, John (Lab, Mid Scotland and Fife) 2 (287)
  • Urquhart, Jean (SNP, Highlands and Islands) 2 (274)
  • Carlaw, Jackson (Con, West Scotland) 2 (484)
  • Scanlon, Mary (Con, Highlands and Islands) 2 (553)
  • Fraser, Murdo (Con, Mid Scotland and Fife) 2 (511)
  • Milne, Nanette (Con, North East Scotland) 2 (536)

I’ll leave it with you to decide whether that’s a good or bad thing.

(Note 1: my definition of rebellion is not great – in Westminster, a vote is only rebellious if it is whipped (i.e. the party declare that all MPs have to vote a certain way) and not all votes are. Some of those above may have been on un-whipped votes.)

Two MPs appear on the plot above twice – John Finnie and Jean Urquhart. They both resigned from the SNP in late October 2012 and so appear once for their votes as members of the SNP and once as Independent MPs. In fact, their independent incarnations can be seen as the two grey circles SouthWest of the main chunk of SNP MSPs (yellow).

A final observation is how much voting is done. In the Westminster parliament attendance at votes is highly variable. In Holyrood it seems much more consistent. The median number of votes case by MPs is 530 (out of 580; 91%) and the 25th and 75th percentiles are 492 (85%) and 558 votes (96%) respectively.

Note 2: An added complication with the Scottish Parliament is that MPs can actively abstain from a vote. The model deals with this by defining an exclusion zone around the line, within which only abstainers can be placed. Below are some examples of individual votes with the MPs now coloured according to how they voted (red = against, green = abstain, blue = for, grey = didn’t vote). The solid line is the vote line, the dashed ones showing the exclusion zone. In these examples (as with most), the visualisation has done a decent job of separating the fors and againsts (reds and blues), with predominantly only the abstains and non-voters appearing in the exclusion zone.

eg1

eg2

eg3

Can you solve this probability question?

A colleague asked me the following probability question:

Suppose I have N empty boxes and in each one I can place a 0 or a 1. I randomly choose n boxes in which to place a 1. If I start from the first box, what is the probability that the first 1 I find will be in the mth box?

This is depicted in the following graphic:

blah

and we’re interested in the probability that the first 1 (reading from the left) is in, say, the third box.

I’m sure there must be a really nice form to the solution, but I can’t come up with one. The best I can do is the following slightly clunky one – any better ideas?

The probability of the first 1 being in the mth box is equal to the number of configurations where the first 1 is in the mth box, divided by the total number of configurations. The total number of configurations is given by:

\left(\begin{array}{c}N\\n\end{array}\right) = \frac{N!}{n!(N-n)!}

and the number of configurations that have their first 1 in the mth box is equal to the subset of configurations that start 0,0,0,…,0,1. This is the same as the number of configurations of (n-1) 1s in (N-m) boxes, i.e. the different ways of building the sequence after the first 1:

\left(\begin{array}{c}N-m\\n-1\end{array}\right) = \frac{N-n!}{(n-1)!(N-m-(n-1))!}

So, the probability of the mth being the first 1 is:

P(m) = \frac{\left(\begin{array}{c}N-m\\n-1\end{array}\right)}{\left(\begin{array}{c}N\\n\end{array}\right)}

Here’s what it looks like for a few different values of n, when N=100:

n=2:
n2N100

n=5:
n5N100

n=20:
n20N100

which make sense – the more 1s you have, the more likely you’ll get one early. So, can anyone produce a neater solution? I’m sure it’ll just involve transforming the problem slightly.

Footnote: because the first one has to occur somewhere between the 1st and (N-n)th position,
\sum_{m=1}^{N-n} P(m) = 1
and therefore:
\left(\begin{array}{c}N\\n\end{array}\right) = \sum_{m=1}^{N-n}\left(\begin{array}{c}N-m\\n-1\end{array}\right),
which seems surprising to me. Although I don’t know why.

Footnote2: Here’s another way of computing it, still a bit clumsy. The problem is the same as if we had N balls in a bag, n of which are red and (N-n) of which are black. If we start pulling balls out of the bag (and not replacing them), the probability that the first red one appears on the mth draw is equal to the probability of drawing m-1 black ones and then drawing one red one. The first probability can be computed from the hyper-geometric distribution as:
\frac{\left(\begin{array}{c}n\\ 0 \end{array}\right) \left(\begin{array}{c}N-n\\ m-1-0 \end{array}\right)}{\left(\begin{array}{c}N\\ m-1 \end{array}\right)}
and the second probability is just equal to the probability of picking one of the n reds from the remaining N-(m-1) balls:
\frac{n}{N-(m-1)}
So the full probability is:
P(m) = \frac{n}{N-(m-1)} \times \frac{\left(\begin{array}{c}n\\ 0 \end{array}\right) \left(\begin{array}{c}N-n\\ m-1-0 \end{array}\right)}{\left(\begin{array}{c}N\\ m-1 \end{array}\right)}
which is arguably messier than the previous one.

Bad Science. Literally.

The following article recently appeared on the Website of ‘Science’, one of the premier scientific journals:
Who’s Afraid of Peer Review?
The issue was also picked up by the Guardian newspaper:
Hundreds of open access journals accept fake science paper

The articles describe how a hoax paper submitted to several hundred Open Access journals (journals whose articles are available free of charge to anyone) was accepted by a large number of them, sometimes without any peer review. This is obviously bad.

Although the article raises some important issues it seems wholly unfair to label these issues as Open Access issues. I’m confident that there are plenty of shifty journals out there (an unfortunate side-effect of the (also unfortunate) increasing pressure to publish more and more), some of which are OA and some of which are not. To make claims about OA, the Science hoax needed to also submit the paper to a large number of non OA journals and compare the acceptance rates. In fact, it’s very surprising that Science published a study that was so, well, unscientific. Perhaps it’s a double hoax?

Science, by the way, is not OA. If you don’t work in a University that has a subscription, it’ll cost you $150 per year to read the research that is mainly funded through public funds. The current OA system isn’t perfect but it feels like things are moving in the right direction.

If you want more, the hoax has already been picked apart by e.g.: Open access publishing hoax: what Science magazine got wrong, and I’m sure there will be others.

Visualising touch errors

Last year (I’ve been meaning to write this for a while), Daniel Buschek spent a few months in the IDI group as an intern. His work here resulted in an interesting study looking at whether the variability of touch performance on mobile phones between users and between devices. Basically, we were interested in seeing if an algorithm trained to improve a user’s touch performance on one phone could be translated to another phone. To find the answer, you’ll have to read the paper…

During the data collection phase of the project, Daniel produced some little smartphone apps for visualising touch performance. For example, on a touch-screen smartphone navigate to:

http://dcs.gla.ac.uk/taps/demos/tapping2/

Hold your phone in one hand and repeatedly touch the cross-hair targets with the thumb of the hand you’re holding the phone with. Once you’ve touched 50 targets the screen will change. If you’re going to do it, do it now before you read on!


Ok, so now you’ll see something that looks a bit like this:
photo

This is a visualisation of touch accuracy (or lack of) – the lines moving to the right (from green to red) demonstrate that I typically touch some distance to the right of where I’m aiming (this is common for holding the phone with the right hand and using the right thumb).

Here’s how it works: at the start, the app chooses 5 targets (the green points). These are presented to you in a random order. When you touch, the app records where you touched and replaces the original target location with the position where you actually touched. i.e. there are still five targets, but one of them has moved. Once you’ve seen (and moved) all five targets, it loops through them again showing you the new targets and again replacing with the new touch. Because most people have a fairly consistent error between where they are aiming and where they actually touch, plotting all of the targets results in a gradual drift in one direction, finishing (after 50 touches, 10 for each trace) with the red points. Note that not all people have this consistency (something discussed in Daniel’s paper).

To demonstrate how the errors vary, here’s another one I did, but this time with the left hand where the offset is now in a completely different direction:photo

One particularly odd feature (for me anyway) is that I know that I’m touching with errors, but don’t seem to be able to correct it! The fact that the errors are so different for left-hand and right-hand use points to a problem in designing methods to correct for errors – we need to which hand the user is using.

If you’d like to try one target rather than 5, use this link:
http://daniel-buschek.de/demos/tapping2/

If you’re interested in why there is an offset, then this paper by Christian Holz and Patrick Baudisch is a good starting point.

Running…

In the Machine Learning course I teach (designed with Mark Girolami), I use Olympic 100m winning times to introduce the idea of making predictions from data (via linear regression). Here’s the data, with the linear regression line projected into the future. Black is men, red women (dashed lines represent the plus and minus three standard deviation region):
menvwomen0
There are two reasons for using this data. The first is that it’s nice and visual and something students can relate to. The second is that it opens up a discussion about some of the bad things you can do when building predictive models. In particular, what happens if we move further into the future. Here’s the same plot up to 2200:
menvwomen
Due to the fact that women have historically been quicker at getting faster, the two lines cross at the 2156 Olympics. At about this point in the lectures a vocal student will pipe up and point out that assuming the trend continues this far into the future looks a bit iffy. And if we assume a linear trend indefinitely, we eventually get to the day where the 100m is won in a negative number of seconds:
menvwomen2
Hopefully you’ll agree that we should be fairly sceptical of any analysis that says this could happen. So I was quite surprised to discover when reading David Epstein’s The Sports Gene that academics had done exactly this. And published it in Nature. NATURE! Arguably the top scientific journal IN THE WORLD. If you don’t believe me, here it is:
Momentous sprint at the 2156 Olympics.

Rant: Politicians and YouGov polls

Louise Mensch is the Conservative MP for Corby.  Today she tweeted this:

RT @RicHolden: #YouGov polling numbers for female voters: #Conservatives 43%; #Labour 41%; LD 9%; Oth 7%. #feminism #VoteConservative

There is not a lot of difference between 41% and 43%.  The polling companies can’t feasibly ask every woman in the UK.  If you only ask a subset then small differences might just be because you happened to ask an unrepresentative sample.

Here is the actual poll data:

http://cdn.yougov.com/cumulus_uploads/document/rgjtpnh1ug/YG-Archives-Pol-Sun-results-090112.pdf

which says that they asked 1727 people.  That’s a reasonable number of people.  If you are statistically minded, you could work out if 1727 people are enough to make any conclusions about a 2% difference.

If you’re not statistically minded (or are, but are lazy), you can instead look at the same survey done just 3 days earlier:

http://cdn.yougov.com/cumulus_uploads/document/d51j2t3jzl/YG-Archives-Pol-ST-results-06-080112.pdf

and discover that three days ago, it was Lab 42%, Con 37%!

I can think of two reasons for this:

1: Women have suddenly (IN THE LAST 3 DAYS) started a mass Labour exodus (Ed Milliband has been pretty useless).

2: Polling a small number of people isn’t very reliable and making a big deal about it when it favours your particular cause makes you look a bit silly.

I’m inclined to go for the latter, although it would be nice to think that female support for UKIP is a fifth of what it was 3 days ago.  And that the 1% of males who would have voted for the BNP on the 6th Jan have since been culled/educated.

Two interesting features do stand out from the data:

1. Women are consistently more likely to give the answer ‘don’t know’ than men.  I don’t believe that women really ‘don’t know’ more (or that men ‘know’ more) so did a quick search to see if this is something that has been studied.  Could only find this (didn’t look for long), which looks at it for the quite specific question “do you snore” (!!): Correlates of the “don’t know” response to questions about snoring.

2. Older people become much less likely to say ‘don’t know’ as they get older.  Actually, I know some old people and this isn’t surprising.

Visualising parliamentary data…again

In a previous post, I described a very general binary PCA algorithm for visualising voting data from the House of Commons.

To recap, we start with voting data which tells us how each of the 600 or so MPs voted in each of the 1400 or so votes of each parliament. An MP can vote for a particular motion or against it, or may not vote at all. The algorithm converts this into a two-dimensional plot where each MP is represented by a point. MPs close together have similar voting patterns, and those far apart don’t. See this for a better description.

I’ve now extended the algorithm a bit – not time to go into details now – but the outcome is that it produces a visualisation that incorporates a degree of uncertainty in the location of the MPs. This uncertainty can come from one of two sources:
1. Lack of data: if MPs don’t vote very much, we can’t be sure about where they should be plotted (a nice example are the three deputy speakers in 2010plain – big ellipses near the centre).
2. Lack of conformity: it might be hard to place some MPs in two dimensions in such a way that most of their votes are modelled correctly.

The following plots show the results for the 1997, 2001, 2005 and 2010 (up to about May 2010) parliaments. Each ellipse is an MP, and they’ve been coloured according to their party (main parties are obvious, key at the bottom for the others). The ellipses roughly represent where the MP might lie — the bigger the ellipse, the less sure we are about the location.

The lines on the plots represent the votes. MPs on one side of the line voted for the motion and on the other against. It would be nice to label some of the votes, maybe I’ll do this soon.

Anyway, here are the files. They are a bit messy, and I’ve labeled some of the MPs who I thought might be interesting. Happy to label any others if anyone is interested.

1997plain
2001plain
2005plain
2010plain

Some things that I think are interesting:
1. Clare Short before and after (Clare Short2) resigning (2005plain).
2. Ken Livingstone before and after (Ken Livingstone 2) resigning (1997plain)
3. How much the nationalist parties who were previously close to the Conservatives and Lib Dems (e.g. DUP and SNP) have now deserted them (Compare everything with 2010plain)
4. How close Clegg and Cameron are (2010plain) (couldn’t find a font small enough to separate them)
5. There appear to be more rebellious Conservatives in the coalition (2010plain) than Lib Dems (more lines (votes) dissect the blue ellipses than the yellow ones).
6. In 1997, the Lib Dems seemed to vote almost equally with Labour and with the Conservatives (roughly the same number of lines splitting Lib&Lab from Con as there are splitting Lib&Con from Lab. In 2001 and 2005, the Lib Dems seem more aligned with the Conservatives than Labour (more lines splitting Lab from Lib&Con than splitting Lib&Lab from Con).

Key:
Red: Labour
Blue: Conservative
Yellow: Lib Dem
Magenta: DUP
Orange: Plaid Cymru
Weird pinky-orange colour: Scottish National Party (normally found next to Plaid Cymru)
Green: All the rest