In a previous post, I compared the results of analysing data from various parliaments using classical PCA and Probabilistic PCA. The point of this was to show the effect that naively handling missing values – assuming they had a value of 0 – had on the PCA analysis: It looked like MPs from the same parties were more different from one another than perhaps they really are – all those zeros were having too much of a say on the outcome.
Since then, I’ve been playing with this a little more, going in two connected directions. Firstly, the data is binary and PCA/PPCA are both built to handle real values. It would be better to use a PCA-like approach that is specifically designed for binary data. Some exist in the literature; I’ve also made a new(ish) one myself.
Secondly, I’ve become interested in how best to handle missing values in the PPCA models. The PPCA approach I used previously simultaneously found the components and filled in the missing data as it (the model) would have expected to see it. On the face of it, this seems reasonable. An alternative is to explicitly build into the model the fact that not every MP-vote pair exists. In this case, the model knows not to expect each MP to vote each time and only takes into account the pairs that do exist.
The difference between the two may seem subtle, but it makes quite a significant difference to the outcome. In particular, both methods allow us to not just see whereabouts an MP is in the reduced space (where the dot is in the two-dimensional plots), but to get some idea of the variance of this position: how certain the model is of where the MP should be. I would hope that the more often an MP voted, the more certain we’d be about her position. The method that only takes into account the data that exists behaves in this way, the method that guesses the missing votes doesn’t.
Why? Well, when the method comes to working out the position of the MP, it’s certainty will depend on (amongst other things) how much data there is. In the case of the method that guesses, it sees its own predictions as ‘data’ and becomes much more confident (overly so). Not only does it see more data (or what it thinks is data), the additional stuff it sees conforms exactly to what it expects to see (because it produced it itself). This results in an endless confidence increasing (and hence uncertainty decreasing) spiral.
Of course, this is just one model and I’m sure that other models exist that impute missing values as they optimise without creating such a bias and it’s important to remember that this problem comes about here because the guessing of the missing values and the determining of the MP positions are done simultaneously. We can always impute them at the end, once the model is happy with its predictions of the MPs positions.
However, it opens up an interesting question as to whether we should ever be imputing missing values as we go along in models such as this when we have a model that doesn’t make it necessary. People seem to do it, and I guess they do it because it seems like they’re getting something for nothing. But, the amount of information at our disposal is determined by the data that we can see. We can fill in gaps, but doing so doesn’t give us more information. The problem with the PPCA model that I used is that it was filling in the numbers and then assuming that these values were to be treated on a par with the actual data.
The practical implication is that the results I showed probably gave the impression that the parties are more coherent than they actually are. I’m going to redo this with the more sensible model soon.
Footnote: Interestingly, a recent JMLR paper tackles the problem of missing values in PCA in (a lot) of detail: ilin10a.pdf