Was Murray unlucky?

Andy Murray played Rafael Nadal today in the semi-final of the ATP tour finale at the O2 arena in London. Nadal won a great match, 7-6 (7-5), 3-6, 7-6.

Here are the match stats (couldn’t work out how to generate a permanent link to this):

Murray won the most points (5 more than Nadal), but still lost the match.

The other numbers near the bottom of the table tell us how many points each player won when serving and when receiving. Murray served 114 points of which he won 78; Nadal served 109 of which he won 73.

I’ve been considering building a statistical model for a tennis match for a while. A model is a mathematical representation of something in the world (e.g. a tennis match). With a good model, we can learn something about the match that perhaps isn’t immediately obvious and maybe predict what might happen in the future.

The number of points won when serving/receiving could form the basis of a simple model of a tennis match. Murray won 68.4% of the points when he served, and 33.0% of the points when Nadal was serving. Using this information, and a knowledge of the tennis scoring system, it is possible to generate potential matches: for each point, we flip one of two dodgy coins (depending on who is serving) and if it lands heads, we award the point to Murray. The first coin should be designed to land heads 68.4% of the time, the second 33.0% of the time. This would be time-consuming, but fortunately, it is the kind of thing that is very fast on a computer.

I’ve generated 10,000 matches in this way, of which Murray won 5753 (57.5%). In other words, if this model is reasonably realistic (more on this later), then we would expect Murray to win more often than lose if they played at the same level (served and received as well as they did today) in the future.

We can look a bit deeper into the results and work out how likely different scorelines are:

Murray 2 – 0 Nadal: 30.3%
Murray 2 – 1 Nadal: 27.2%
Murray 1 – 2 Nadal: 22.3%
Murray 0 – 2 Nadal: 20.1%

Another interesting stat is how many points we’d expect to see. In the following plot, we can see the number of points in each of the 10,000 simulated matches (the quality is a bit low – click the image to see a better version):

The red line shows the number in the actual match: 223. Only 1.6% of the simulated matches involved 223 or more points. So, if we’re happy that our model is realistic, the match was surprisingly long.

How realistic is the model? Not very. It’s based on the assumption that all points are independent of one another. In other words, the outcome of a particular point doesn’t depend on what’s already happened – an unrealistic assumption – players are affected by previous points. It also assumes that the chance of say, Murray winning a point when he is serving is constant throughout the match. This is also unrealistic – players get tired. These caveats don’t necessarily mean that the model is useless, but they should feature prominently in any conclusions.

One way to make it more realistic, would be to use the stats from the three different sets when performing the simulation. The stats for the three sets (from Murray’s point of view) are:
Set 1: Serving 28/36 (77.8%), Receiving 7/36 (19.4%)
Set 2: Serving 18/25 (72.0%), Receiving 14/30 (46.7%)
Set 3: Serving 32/53 (60.4%), Receiving 15/43 (34.9%)

Simulating 10,000 results with this enhanced model results in Murray winning 5932 matches (59.3%). This is slightly higher than the previous result but we’d need to simulate more matches to be sure it’s not just a bit higher by chance.

Here is the breakdown of the possible match scores:

Murray 2 – 0 Nadal: 40.2%
Murray 2 – 1 Nadal: 19.1%
Murray 1 – 2 Nadal: 37.5%
Murray 0 – 2 Nadal: 3.2%

This is very different to the previous breakdown. Scores of 2-0 and 1-2 have both become more likely and a Nadal 2-0 win looks very unlikely. We can see why if we look more closely at who wins each set. The first set is pretty close: Murray wins it 42.6% of the time. The second set is incredibly one-sided with Murray winning it 94.5% of the time. If the match goes to a third set, Murray wins it 34.1% of the time.

Interestingly, if we look at the number of points again, it’s now even less likely to see 223 or more points. Only 54 matches got this long (or longer) (0.5%). I didn’t expect this – my gut feeling as to why the matches in the first model were on the whole shorter than the real one was that the model wasn’t much good. However, this new model is more realistic, and the lengths have got shorter!

One possible conclusion to the whole analysis is that the scoreline was a bit flattering to Murray – he nearly one the third set although he didn’t play particularly well in it (he could expect to win it only 34.1% of the time). To illustrate how much better Nadal played in the final set we can simulate matches using just the stats from the last set. This results in a very one-sided picture with Nadal winning 7132 (71.3%)!

Obviously the earlier criticisms of the model still apply – we’re now assuming that within a set, each player has a constant probability of winning points. It’s easy to argue that this is still insufficient. However, it certainly provides some food for thought.

Advertisements

Bayesian model selection for beginners

I’ve recently encountered the following problem (or at least, a problem that can be formulated like this, the actual thing didn’t involve coins):

Imagine you’ve been provided with the outcomes of two sets of coin tosses.

First set: N_1 tosses, n_1 heads

Second set: N_2 tosses, n_2 heads

And we want to know whether the same coin was used in both cases.  Note that these ‘coins’ might be very biased (always heads, always tails), we don’t know.

There are classical tests for doing this, but I wanted to do a Bayesian test.  This in itself is not much of a challenge (it’s been done many times before) but what I think is interesting is how to explain the procedure to someone who has no familiarity with the Bayesian approach.  Maybe if that is possible, more people would do it.  And that would be a good thing.  Here’s what I’ve come up with:

We’ll first assume a slight simplification of the problem.  We’ll assume that there are 9 coins in a bag and the person who did the coin tossing either:

Pulled one coin out of the bag and did N_1 tosses followed by N_2 tosses

OR

Pulled one coin out of the bag, did N_1 tosses, put the coin back, pulled out another coin (could have been the same one) and did N_2 tosses with this one.

The coins in the bag all have different probabilities of landing heads.  The probabilities for the 9 coins are: 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9.  We’ll label these coins c_1 to c_9.

Model 1

We’ll start by building a model for each scenario.  The first model assumes that both sets of coin toss data were produced by the same coin.  If we knew which coin it was, the probability of seeing the data we’ve seen would be:

Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n)

i.e. the probability of seeing n_1 heads in the first N_1 tosses of coin c_n multiplied by the probability of seeing n_2 heads in the second N_2 tosses of coin c_2.

In reality we don’t know which coin it was. In this case, it seems sensible to calculate the averageprobability. The total of all of the probabilities divided by the number of coins. This is computed as:

p_1=\sum_{n=1}^9 \frac{1}{9} Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n)

Another way of thinking about this is as a weighted sum of the probabilities where each weight is the probability of that coin being chosen. In this case, each coin was equally likely and so has probability 1/9. In general this doesn’t have to be the case.

Note that we haven’t defined what Pr(n_1|N_1,c_n) is. This isn’t particularly important – it’s just a number that tells us how likely we are to get a particular number of heads with a particular coin.

Model 2
The second model corresponds to picking a coin randomly for each set of coin tosses. The probability of the data in this model, if we knew that the coins were c_n and c_m is:

Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)

Again, we don’t know c_n or c_m so we average over them both. In other words, we average over every combination of coin pairs. There are 81 such pairs and each pair is equally likely:

p_2 = \sum_{n=1}^9 \sum_{m=1}^9 \frac{1}{81} Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)

Comparing the models

We can now compare the numbers p_1 and p_2. If one is much higher than the other, that model is most likely. Lets try an example…

Example 1
Here’s the data:

N_1=10,n_1=9,N_2=10,n_2=1

So, 9 heads in the first set of tosses and only 1 in the second. At first glance, it looks like a different coin was used in both cases. Lets see if our calculations agree. Firstly, here is Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n) plotted on a graph where each bar is a different c_n.

The coin that looks most likely is the one with c_n=0.5. But, look at the scale of the y-axis. All of the values are very low – none of them look like they could have created this data. It’s hard to imagine taking any coin, throwing 9 heads in the first ten and then 1 head in the second ten.

To compute p_1 we add up the total height of all of the bars and divide the whole thing by 9: p_1=0.000029.

We can make a similar plot for model 2:

The plot is now 3D, because we need to look at each combination of c_n and c_m. The height of each bar is:

Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)

The bars are only high when c_n is high and c_m is low. But look at the height of the high bars – much much higher than those in the previous plot. Some combinations of two coins could conceivable have generated this data. To compute p_2 we need to add up the height of all the bars and divide the whole thing by 81: p_2 = 0.008478

This is a small number, but almost 300 times bigger than p_1. The model selection procedure has backed up our initial hunch.

Example 2

N_1=10,n_1=6,N_2=10,n_2=5

i.e. 6 heads in the first 10, 5 heads in the second 10. This time it looks a bit like it might be the same coin. Here’s the plot for model 1:

And here’s the magic number: p_1 = 0.01667

Model 2:

With the number: p_2 = 0.01020.

This time, the model slightly favours model 1.

Why so slightly? Well, the data could have come from two coins – maybe the two coins are similar. Maybe it was the same coin selected twice. For this reason, it is always going to be easier to say that they are different than that they are the same.

Why would it ever favour model 1?

Given that we could always explain the data with two coins, why will the selection procedure ever favour one coin? The answer to this is one of the great benefits of the Bayesian paradigm. Within the Bayesian framework, there is an inbuilt notion of economy: a penalty associated with unnecessarily complex models.

We can see this very clearly here – look at the two plots for the second example – the heights of the highest bars are roughly the same: the best single coin performs about as well as the best pair of coins.  But, the average single coin is better than the average pair of coins.  This is because the total number of possibilities (9 -> 81) has increased faster than the total number of good possibilites. The average goodness has decreased.

Model 2 is more complex than model 1 – it has more things that we can change (2 coins rather than 1). In the Bayesian world, this increase in complexity has to be justified by much better explanatory power. In example 2, this is not the case.

In example 1 we don’t quite see the reverse effect. Now, model 1 had no good possibilities (no single coin could have realistically generated the data) whereas model 2 had some. Hence, model 2 was overwhelmingly more likely than model 1.

Summary

This is a simple example. It also doesn’t explain all of the important features of Bayesian model selection. However, I think it nicely shows the penalty that Bayesian model selection naturally imposes on overly complex models. There are other benefits to this approach that I haven’t discussed. For example, this doesn’t show the way in which p_1 and p_2 change as more data becomes available.

Relating back to Bayes rule

The final piece of the jigsaw is relating what we’ve done above to the equations that might accompany other examples of Bayesian model selection.  Bayes rule tells us that for model 1:

Pr(c_n|n_1,N_1,n_2,N_2) = \frac{Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n)Pr(c_n)}{\sum_{o=1}^9 r(n_1|N_1,c_o)Pr(n_2|N_2,c_o)Pr(c_o)}

and for model 2:

Pr(c_n,c_m|n_1,N_1,n_2,N_2) = \frac{Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)Pr(c_n)Pr(c_m)}{\sum_{o=1}^9 \sum_{q=1}^9 Pr(n_1|N_1,c_o)Pr(n_2|N_2,c_q)Pr(c_o)Pr(c_q)}
for model 2.

In both cases, we computed the denominator of the fraction – known as the evidence. In general, it may consist of an integral rather than a sum but the idea is the same: averaging over all parameters (coins or combinations of coins) to get a single measure of goodness for the model.

How complex is the parliament?

I’ve finally got around to doing some analysis on the parliamentary vote data using a model that can handle both binary data (the votes) and missing data (MPs not voting).

Intuitively, it’s quite easy to imagine how it works:

Assume we have a load of MPs and one vote, that they all attended. They can only vote yes or no (1 or 0). It is clearly possible to get all of the MPs to stand in a line such that all of them to the left of some point voted yes (1) and all to the right voted no (0). We could then characterise each MP by their position in the line and from that could work out how they voted. We’d end up with something that looks like this (each digit is an MP):

000000000111111111

Seem pointless? Yep, it would be. But what if there were two votes? If the MPs voted the same in both votes, it would be easy – we’d still only need one line, and one reference point. It would look like this (each MP is now a column, each row a vote):

V1: 000000011111111
V2: 000000011111111

If they voted completely oppositely, it would still be possible:

V1: 000000011111111
V2: 111111100000000

We can still use a single line if they vote a bit differently in two votes but two reference points will be required:

V1: 00000000111111
V2: 00001111111111

For vote 2, the reference point is slightly to the left of that for vote 1.

It’s just as easy to dream up voting patterns for which we can’t do this. For example, we can’t reorder the following MPs such that for each vote there is a point to the left of which they all vote 0 and to the right 1:

V1: 00000001111111
V2: 00011110000011

What we can do, is position the MPs in 2-dimensions – imagine placing them in a room rather than on a line. For each vote, we’ll split the room in two with a straight line such that all MPs on one side of the line vote one way, and all on the other side vote the other way.

For two votes, we could always position the MPs in two dimensions in this way (much like we could with one vote and a line). We might be able to do 3 votes in 2-dimensions, but we we can’t be sure – it depends on how they voted.

Given data for a full parliament (about 1600 votes), it’s interesting to see how well we can do this with a particular number of dimensions. For example, if we restrict ourselves to two dimensions, is it possible to lay the MPs out and draw a line for each vote such that each MP is on the correct side of each line? If it is not possible, what’s the highest number of MP-vote pairs that we can get right?

This might tell us something about the voting patterns of the parliament. For example, in the UK we have a 3 party system (or at least we have until recently). Lets assume that for each vote in say the 1997 parliament, Labour MPs voted one way, Conservative MPs the other and Lid Dems sided with the other two. If this were the case, we’d only need one dimension:

V1: 0000 1111 1111
V2: 0000 0000 1111
V3: 1111 1111 0000
V4: 1111 0000 0000

Each column is an MP, each block (set of four columns) a party (Labour Lib Dem Conservative, in that order). If we find for real data that we need only one dimension, it suggests this (or something similar) is happening.

Alternatively, if we assume that sometimes Labour and Conservatives vote together and the Lib Dems differently, we would need a second dimension.

The following plot shows the percentage of MP-vote pairs we can get right for different parliaments as we increase the number of dimensions:

(The line for 2010 should be treated with some caution – only 100 votes or so so far.)

To put the y-axis into perspective, in a parliament of 1600 votes and 600 MPs, an increase of 1% corresponds to getting approx 9000 additional votes correct – about 10 MPs worth if the MPs vote about 50% of the time.

The results suggest two things to me. Firstly, the voting patterns in each parliament are pretty simple – we can get a lot of the votes right with two dimensions. This is not surprising – most MPs will vote along party lines and we have three (main) parties. Secondly, it looks like the three successive Labour parliaments have been getting slightly more complex over time – 1997 seems to have the simplest structure, but not by much. Perhaps over time MPs became a bit more rebellious?

PCA, missing values and variance…

In a previous post, I compared the results of analysing data from various parliaments using classical PCA and Probabilistic PCA.  The point of this was to show the effect that naively handling missing values – assuming they had a value of 0 – had on the PCA analysis: It looked like MPs from the same parties were more different from one another than perhaps they really are – all those zeros were having too much of a say on the outcome.

Since then, I’ve been playing with this a little more, going in two connected directions.  Firstly, the data is binary and PCA/PPCA are both built to handle real values.  It would be better to use a PCA-like approach that is specifically designed for binary data.  Some exist in the literature; I’ve also made a new(ish) one myself.

Secondly, I’ve become interested in how best to handle missing values in the PPCA models. The PPCA approach I used previously simultaneously found the components and filled in the missing data as it (the model) would have expected to see it.  On the face of it, this seems reasonable.  An alternative is to explicitly build into the model the fact that not every MP-vote pair exists.  In this case, the model knows not to expect each MP to vote each time and only takes into account the pairs that do exist.

The difference between the two may seem subtle, but it makes quite a significant difference to the outcome.  In particular, both methods allow us to not just see whereabouts an MP is in the reduced space (where the dot is in the two-dimensional plots), but to get some idea of the variance of this position: how certain the model is of where the MP should be.  I would hope that the more often an MP voted, the more certain we’d be about her position.  The method that only takes into account the data that exists behaves in this way, the method that guesses the missing votes doesn’t.

Why?  Well, when the method comes to working out the position of the MP, it’s certainty will depend on (amongst other things) how much data there is.  In the case of the method that guesses, it sees its own predictions as ‘data’ and becomes much more confident (overly so).  Not only does it see more data (or what it thinks is data), the additional stuff it sees conforms exactly to what it expects to see (because it produced it itself).  This results in an endless confidence increasing (and hence uncertainty decreasing) spiral.

Of course, this is just one model and I’m sure that other models exist that impute missing values as they optimise without creating such a bias and it’s important to remember that this problem comes about here because the guessing of the missing values and the determining of the MP positions are done simultaneously. We can always impute them at the end, once the model is happy with its predictions of the MPs positions.

However, it opens up an interesting question as to whether we should ever be imputing missing values as we go along in models such as this when we have a model that doesn’t make it necessary.  People seem to do it, and I guess they do it because it seems like they’re getting something for nothing.  But, the amount of information at our disposal is determined by the data that we can see.  We can fill in gaps, but doing so doesn’t give us more information.  The problem with the PPCA model that I used is that it was filling in the numbers and then assuming that these values were to be treated on a par with the actual data.

The practical implication is that the results I showed probably gave the impression that the parties are more coherent than they actually are.  I’m going to redo this with the more sensible model soon.

Footnote: Interestingly, a recent JMLR paper tackles the problem of missing values in PCA in (a lot) of detail: ilin10a.pdf

PCA..again

Finally had a moment to do the other parliaments for which I have data.  Here they are.  In each case the first plot is classic PCA and the second probabilistic:

1997

The second axis is flipped in the PPCA plot – this happens in PCA – but it’s kind of interesting how there seem to be a selection of parties (PC, SDLP, SNP) on a line between Labour and the Lib Dems possibly representing a kind of political spectrum?  There also seems to be a huge wasteland splitting the Tories from everyone else.

2001

Labour appear a lot less coherent than in 1997 (just comparing the probabilistic plots, I don’t really trust the spread in the classic ones because of the missing value problem).  Also, the line of small parties that seemed to link Lab and Lib Dem in 1997 has been pulled towards the Tories.  (Again, ignore the 2nd axis flip).

2010 (only 49 divisions so far)

So, not much data yet – only 49 divisions (votes) – the effect of the coalition is clear and it looks like interesting groups may be developing between the Tories and Lid Dems (we would need more data to be sure of this).  The coalition seems to have scared off the DUP (especially), PC and SNP who have all dramatically migrated towards Labour.

What next?

One criticism of this analysis is that PCA and PPCA both assume that the data are real-valued which is not the case for this data (each MP-vote pair is either -1, 1 or missing).  I’ve been pointed in the direction of PCA-like techniques for binary data and will post some results ar some point if I can get it to go a bit faster (there’s quite a lot of data!).

Politicians, visualisation, and missing values

British MPs do a lot of voting.  In an average parliament (if there is such a thing), there can be upwards of 1500 separate votes, also known as divisions.  The Public Whip is a brilliant site, dedicated to providing this data in an easy to use format.

The Public Whip also have a couple of nice little visualisations of the data, in particular the one here. This uses a mathematical techniques known as multi-dimensional scaling (MDS) to convert the specific voting pattern for each MP into two numbers that can then be plotted onto a map.  In other words, it lets us visualise the 1500 or so votes in a nice 2-dimensional plot.

To do this, one has to convert the votes into numbers.  Lets say, for arguments sake, that we use the number ‘1’ if the MP voted for a particular motion, ‘-1’ if they voted against it and ‘0’ for everything else (abstention, laziness, etc).  MDS isn’t the only technique that can be used for visualisation like this.  Chris Lightfoot used Principal Components Analysis (PCA) – a technique similar in mission to MDS but resulting in a different 2-dimensional plot (I’ve added an accessible-ish guide to PCA below – apologies to the purists).

I find this kind of exploratory analysis interesting and so got a project student to build some software that would allow the user to perform this analysis under different conditions (different parliaments, different sets of MPs, etc).  The result was a really nice bit of software that will hopefully be made freely available somewhere at some point.  My interest didn’t dwindle though and  I had a bit of free time the other day so exported some data from the application for the 2005 parliament to play with.  I used Matlab do a standard PCA on the 2005 parliament.  Here is a plot of the MPs in their new world, coloured according to party (some of the colours aren’t very clear…sorry!):

The two dimensions chosen by PCA nicely split the three main parties into separate clouds. This isn’t surprising: MPs will often vote with their party and so we’d expect MPs within a party to vote similarly and MPs from different parties to vote differently.

However, there seems to a strange centralising force!  Something is pulling politicians from all three parties into the centre of the plot, particularly obvious for the Conservatives (blue) whose MPs seem to form a line pointing towards the point (0,0).

It would be nice to think that this was the result of PCA capturing some interesting diversity within the main parties. However, as Chris Lightfoot discusses in his original analysis, this isn’t always the case: it’s impossible to disambiguate such real variation from artificial variation due to our representation of the data.

Recall that we gave each MP one of three values for each vote: -1,1 or 0.  It is this last category that is the problem.  An MP voting neither for nor against a particular motion could be doing so for a number of reasons, the most likely of which is that he or she wasn’t present on the day of the vote (play with this data for a while, and you’ll be struck by how rarely some MPs do vote).  Encoding this as ‘0’ has the effect of pulling MPs who don’t vote very often towards the centre of the plot.

Ideally, rather then being forced to pick a number, it would be better if we could treat it as missing.  However, PCA forces us to supply a number for each vote for all MPs.  Using ‘0’ (or indeed anything) is a problem.  Think of it this way: by using ‘0’, we are saying that the MP is sitting on the fence – halfway between for (1) and against (-1).  In some cases this may be reasonable, but for the most part, it’s likely to be a gross over-simplification.

In statistical terms, these values are ‘missing’ and the problem of missing values is widely studied in statistics because they occur all over the place.  Sometimes they occur randomly – imagine that the complete data-set exists somewhere and each value is kept or removed based on the toss of a coin. Other times there will be something systematic that causes values to be missing.  Either way, there is a lot of literature devoted to this.

Sadly, there is no wonderful solution to the missing value problem using standard PCA.  People use ad-hoc techniques like inserting average values or a constant value (that’s what I did with the ‘0’s).  Chris Lightfoot tried a couple of things that were, by his own admission, hacks.   They produced some interesting patterns but the problem with hacking around like this is that it’s impossible to be sure whether or not what we’re seeing means anything.

Fortunately, help is at hand.  Probabilistic models are popular within the data analysis world and one reason is their ability to handle missing values.  In fact, they not only overcome them, but can also simultaneously make an educated guess at what the values should have been – how the MP would have voted, had they been present.  One such model is Probabilistic PCA published by researchers from Microsoft Research in Cambridge (download it here – might be paywalled).  There is also what looks like an older technical from Sam Roweis (available here).  The details are all in those papers – mathematical literacy required!  Suffice to say that we can leave lots of MP-vote combinations blank and let the algorithm do its thing.

A bit of googling uncovered some code courtesy of Jakob Verbeek that does PPCA (sensibly, a la Roweis) in Matlab.  Here are the results:

The difference is striking – each party is now represented by a much tighter cluster. A tentative conclusion might be that most of the diversity observed in the previous plot was due to attendance, rather than voting patterns. This is a bit of a shame: I’d like my MPs to be a bit more independent, but that’s just my opinion.

There is some diversity in the plot, and we can probably be confident that this is real.  Look at the little tail of Labour MPs (red) going towards the Lib Dems. The fact that they’re joined by SDLP members is reassuring. I need to spend some time identifying some of these people to be certain that it makes sense but my guess is that it will. Other points of interest: in the first analysis, the DUP (black dots, hard to distinguish from blue ones) were intermingled with the Conservatives. It looks like this is just due to attendance: in the second plot, they form a nice little coherent group all by themselves – a kind of mini, more liberal Conservative party.

The only group that are now spread around are the independents (green dots) and there is no reason why they would vote together because, um, they’re independent.

At first glance, it looks like the PPCA is doing a great job – the grouping is incredibly clean.  When I get some more time, I’d like to look into this further – PCA (and PPCA) find more than 2-dimensions.  Looking at the third, fourth, fifth etc might provide some interesting patterns.  It would be nice to also look at other parliaments (data from 1997, 2001 and 2010) is available.  It would also be nice to look at the data for just one party.  Inter-party differences will always be greater than intra-party differences (that’s why they are different parties!) and so swamps the analysis.

If nothing else, it’s a nice dataset for the data analysis community to play with to test their algorithms for projection and missing value imputation.

Appendix: Hand-wavy introduction to PCA

Imagine that there are three votes, rather than 1000.  We want to combine these three numbers (remember that the votes are now numbers: 0,-1 or 1) somehow to give us 2 numbers for each MP that we can use for plotting – an x-coordinate and a y-coordinate if you like.

One way in which we could combine them is to add them up.  If a particular MP voted -1,-1,0, their new number would be -2.  More generally, we can assign a weight to each vote, multiply the vote by the weight and add up the results.  Say our weights were 1,-2,5 our number would become (1 x -1) + (-2 x -1) + (5 x 0) = 1.

We could pick as many sets of weights as we like to give us as many new numbers as we like!  This is exactly what PCA does – it picks 2 sets of weights such that for each MP, we can create 2 numbers from her voting pattern (actually PCA can get more than 2, but let’s not worry about that).

I picked those weights randomly.  PCA picks them systematically such that when we plot the MPs in their new 2-dimensional world, they are as different as possible. This is good, because it tends to reveal any group structure in the data.  If MPs vote in groups (parties), these groups will be apparent when we do the PCA.