Can you solve this probability question?

A colleague asked me the following probability question:

Suppose I have N empty boxes and in each one I can place a 0 or a 1. I randomly choose n boxes in which to place a 1. If I start from the first box, what is the probability that the first 1 I find will be in the mth box?

This is depicted in the following graphic:


and we’re interested in the probability that the first 1 (reading from the left) is in, say, the third box.

I’m sure there must be a really nice form to the solution, but I can’t come up with one. The best I can do is the following slightly clunky one – any better ideas?

The probability of the first 1 being in the mth box is equal to the number of configurations where the first 1 is in the mth box, divided by the total number of configurations. The total number of configurations is given by:

\left(\begin{array}{c}N\\n\end{array}\right) = \frac{N!}{n!(N-n)!}

and the number of configurations that have their first 1 in the mth box is equal to the subset of configurations that start 0,0,0,…,0,1. This is the same as the number of configurations of (n-1) 1s in (N-m) boxes, i.e. the different ways of building the sequence after the first 1:

\left(\begin{array}{c}N-m\\n-1\end{array}\right) = \frac{N-n!}{(n-1)!(N-m-(n-1))!}

So, the probability of the mth being the first 1 is:

P(m) = \frac{\left(\begin{array}{c}N-m\\n-1\end{array}\right)}{\left(\begin{array}{c}N\\n\end{array}\right)}

Here’s what it looks like for a few different values of n, when N=100:




which make sense – the more 1s you have, the more likely you’ll get one early. So, can anyone produce a neater solution? I’m sure it’ll just involve transforming the problem slightly.

Footnote: because the first one has to occur somewhere between the 1st and (N-n)th position,
\sum_{m=1}^{N-n} P(m) = 1
and therefore:
\left(\begin{array}{c}N\\n\end{array}\right) = \sum_{m=1}^{N-n}\left(\begin{array}{c}N-m\\n-1\end{array}\right),
which seems surprising to me. Although I don’t know why.

Footnote2: Here’s another way of computing it, still a bit clumsy. The problem is the same as if we had N balls in a bag, n of which are red and (N-n) of which are black. If we start pulling balls out of the bag (and not replacing them), the probability that the first red one appears on the mth draw is equal to the probability of drawing m-1 black ones and then drawing one red one. The first probability can be computed from the hyper-geometric distribution as:
\frac{\left(\begin{array}{c}n\\ 0 \end{array}\right) \left(\begin{array}{c}N-n\\ m-1-0 \end{array}\right)}{\left(\begin{array}{c}N\\ m-1 \end{array}\right)}
and the second probability is just equal to the probability of picking one of the n reds from the remaining N-(m-1) balls:
So the full probability is:
P(m) = \frac{n}{N-(m-1)} \times \frac{\left(\begin{array}{c}n\\ 0 \end{array}\right) \left(\begin{array}{c}N-n\\ m-1-0 \end{array}\right)}{\left(\begin{array}{c}N\\ m-1 \end{array}\right)}
which is arguably messier than the previous one.


Bad Science. Literally.

The following article recently appeared on the Website of ‘Science’, one of the premier scientific journals:
Who’s Afraid of Peer Review?
The issue was also picked up by the Guardian newspaper:
Hundreds of open access journals accept fake science paper

The articles describe how a hoax paper submitted to several hundred Open Access journals (journals whose articles are available free of charge to anyone) was accepted by a large number of them, sometimes without any peer review. This is obviously bad.

Although the article raises some important issues it seems wholly unfair to label these issues as Open Access issues. I’m confident that there are plenty of shifty journals out there (an unfortunate side-effect of the (also unfortunate) increasing pressure to publish more and more), some of which are OA and some of which are not. To make claims about OA, the Science hoax needed to also submit the paper to a large number of non OA journals and compare the acceptance rates. In fact, it’s very surprising that Science published a study that was so, well, unscientific. Perhaps it’s a double hoax?

Science, by the way, is not OA. If you don’t work in a University that has a subscription, it’ll cost you $150 per year to read the research that is mainly funded through public funds. The current OA system isn’t perfect but it feels like things are moving in the right direction.

If you want more, the hoax has already been picked apart by e.g.: Open access publishing hoax: what Science magazine got wrong, and I’m sure there will be others.

Visualising touch errors

Last year (I’ve been meaning to write this for a while), Daniel Buschek spent a few months in the IDI group as an intern. His work here resulted in an interesting study looking at whether the variability of touch performance on mobile phones between users and between devices. Basically, we were interested in seeing if an algorithm trained to improve a user’s touch performance on one phone could be translated to another phone. To find the answer, you’ll have to read the paper…

During the data collection phase of the project, Daniel produced some little smartphone apps for visualising touch performance. For example, on a touch-screen smartphone navigate to:

Hold your phone in one hand and repeatedly touch the cross-hair targets with the thumb of the hand you’re holding the phone with. Once you’ve touched 50 targets the screen will change. If you’re going to do it, do it now before you read on!

Ok, so now you’ll see something that looks a bit like this:

This is a visualisation of touch accuracy (or lack of) – the lines moving to the right (from green to red) demonstrate that I typically touch some distance to the right of where I’m aiming (this is common for holding the phone with the right hand and using the right thumb).

Here’s how it works: at the start, the app chooses 5 targets (the green points). These are presented to you in a random order. When you touch, the app records where you touched and replaces the original target location with the position where you actually touched. i.e. there are still five targets, but one of them has moved. Once you’ve seen (and moved) all five targets, it loops through them again showing you the new targets and again replacing with the new touch. Because most people have a fairly consistent error between where they are aiming and where they actually touch, plotting all of the targets results in a gradual drift in one direction, finishing (after 50 touches, 10 for each trace) with the red points. Note that not all people have this consistency (something discussed in Daniel’s paper).

To demonstrate how the errors vary, here’s another one I did, but this time with the left hand where the offset is now in a completely different direction:photo

One particularly odd feature (for me anyway) is that I know that I’m touching with errors, but don’t seem to be able to correct it! The fact that the errors are so different for left-hand and right-hand use points to a problem in designing methods to correct for errors – we need to which hand the user is using.

If you’d like to try one target rather than 5, use this link:

If you’re interested in why there is an offset, then this paper by Christian Holz and Patrick Baudisch is a good starting point.


In the Machine Learning course I teach (designed with Mark Girolami), I use Olympic 100m winning times to introduce the idea of making predictions from data (via linear regression). Here’s the data, with the linear regression line projected into the future. Black is men, red women (dashed lines represent the plus and minus three standard deviation region):
There are two reasons for using this data. The first is that it’s nice and visual and something students can relate to. The second is that it opens up a discussion about some of the bad things you can do when building predictive models. In particular, what happens if we move further into the future. Here’s the same plot up to 2200:
Due to the fact that women have historically been quicker at getting faster, the two lines cross at the 2156 Olympics. At about this point in the lectures a vocal student will pipe up and point out that assuming the trend continues this far into the future looks a bit iffy. And if we assume a linear trend indefinitely, we eventually get to the day where the 100m is won in a negative number of seconds:
Hopefully you’ll agree that we should be fairly sceptical of any analysis that says this could happen. So I was quite surprised to discover when reading David Epstein’s The Sports Gene that academics had done exactly this. And published it in Nature. NATURE! Arguably the top scientific journal IN THE WORLD. If you don’t believe me, here it is:
Momentous sprint at the 2156 Olympics.

Visualising parliamentary data…again

In a previous post, I described a very general binary PCA algorithm for visualising voting data from the House of Commons.

To recap, we start with voting data which tells us how each of the 600 or so MPs voted in each of the 1400 or so votes of each parliament. An MP can vote for a particular motion or against it, or may not vote at all. The algorithm converts this into a two-dimensional plot where each MP is represented by a point. MPs close together have similar voting patterns, and those far apart don’t. See this for a better description.

I’ve now extended the algorithm a bit – not time to go into details now – but the outcome is that it produces a visualisation that incorporates a degree of uncertainty in the location of the MPs. This uncertainty can come from one of two sources:
1. Lack of data: if MPs don’t vote very much, we can’t be sure about where they should be plotted (a nice example are the three deputy speakers in 2010plain – big ellipses near the centre).
2. Lack of conformity: it might be hard to place some MPs in two dimensions in such a way that most of their votes are modelled correctly.

The following plots show the results for the 1997, 2001, 2005 and 2010 (up to about May 2010) parliaments. Each ellipse is an MP, and they’ve been coloured according to their party (main parties are obvious, key at the bottom for the others). The ellipses roughly represent where the MP might lie — the bigger the ellipse, the less sure we are about the location.

The lines on the plots represent the votes. MPs on one side of the line voted for the motion and on the other against. It would be nice to label some of the votes, maybe I’ll do this soon.

Anyway, here are the files. They are a bit messy, and I’ve labeled some of the MPs who I thought might be interesting. Happy to label any others if anyone is interested.


Some things that I think are interesting:
1. Clare Short before and after (Clare Short2) resigning (2005plain).
2. Ken Livingstone before and after (Ken Livingstone 2) resigning (1997plain)
3. How much the nationalist parties who were previously close to the Conservatives and Lib Dems (e.g. DUP and SNP) have now deserted them (Compare everything with 2010plain)
4. How close Clegg and Cameron are (2010plain) (couldn’t find a font small enough to separate them)
5. There appear to be more rebellious Conservatives in the coalition (2010plain) than Lib Dems (more lines (votes) dissect the blue ellipses than the yellow ones).
6. In 1997, the Lib Dems seemed to vote almost equally with Labour and with the Conservatives (roughly the same number of lines splitting Lib&Lab from Con as there are splitting Lib&Con from Lab. In 2001 and 2005, the Lib Dems seem more aligned with the Conservatives than Labour (more lines splitting Lab from Lib&Con than splitting Lib&Lab from Con).

Red: Labour
Blue: Conservative
Yellow: Lib Dem
Magenta: DUP
Orange: Plaid Cymru
Weird pinky-orange colour: Scottish National Party (normally found next to Plaid Cymru)
Green: All the rest

Bayesian model selection for beginners

I’ve recently encountered the following problem (or at least, a problem that can be formulated like this, the actual thing didn’t involve coins):

Imagine you’ve been provided with the outcomes of two sets of coin tosses.

First set: N_1 tosses, n_1 heads

Second set: N_2 tosses, n_2 heads

And we want to know whether the same coin was used in both cases.  Note that these ‘coins’ might be very biased (always heads, always tails), we don’t know.

There are classical tests for doing this, but I wanted to do a Bayesian test.  This in itself is not much of a challenge (it’s been done many times before) but what I think is interesting is how to explain the procedure to someone who has no familiarity with the Bayesian approach.  Maybe if that is possible, more people would do it.  And that would be a good thing.  Here’s what I’ve come up with:

We’ll first assume a slight simplification of the problem.  We’ll assume that there are 9 coins in a bag and the person who did the coin tossing either:

Pulled one coin out of the bag and did N_1 tosses followed by N_2 tosses


Pulled one coin out of the bag, did N_1 tosses, put the coin back, pulled out another coin (could have been the same one) and did N_2 tosses with this one.

The coins in the bag all have different probabilities of landing heads.  The probabilities for the 9 coins are: 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9.  We’ll label these coins c_1 to c_9.

Model 1

We’ll start by building a model for each scenario.  The first model assumes that both sets of coin toss data were produced by the same coin.  If we knew which coin it was, the probability of seeing the data we’ve seen would be:


i.e. the probability of seeing n_1 heads in the first N_1 tosses of coin c_n multiplied by the probability of seeing n_2 heads in the second N_2 tosses of coin c_2.

In reality we don’t know which coin it was. In this case, it seems sensible to calculate the averageprobability. The total of all of the probabilities divided by the number of coins. This is computed as:

p_1=\sum_{n=1}^9 \frac{1}{9} Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n)

Another way of thinking about this is as a weighted sum of the probabilities where each weight is the probability of that coin being chosen. In this case, each coin was equally likely and so has probability 1/9. In general this doesn’t have to be the case.

Note that we haven’t defined what Pr(n_1|N_1,c_n) is. This isn’t particularly important – it’s just a number that tells us how likely we are to get a particular number of heads with a particular coin.

Model 2
The second model corresponds to picking a coin randomly for each set of coin tosses. The probability of the data in this model, if we knew that the coins were c_n and c_m is:


Again, we don’t know c_n or c_m so we average over them both. In other words, we average over every combination of coin pairs. There are 81 such pairs and each pair is equally likely:

p_2 = \sum_{n=1}^9 \sum_{m=1}^9 \frac{1}{81} Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)

Comparing the models

We can now compare the numbers p_1 and p_2. If one is much higher than the other, that model is most likely. Lets try an example…

Example 1
Here’s the data:


So, 9 heads in the first set of tosses and only 1 in the second. At first glance, it looks like a different coin was used in both cases. Lets see if our calculations agree. Firstly, here is Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n) plotted on a graph where each bar is a different c_n.

The coin that looks most likely is the one with c_n=0.5. But, look at the scale of the y-axis. All of the values are very low – none of them look like they could have created this data. It’s hard to imagine taking any coin, throwing 9 heads in the first ten and then 1 head in the second ten.

To compute p_1 we add up the total height of all of the bars and divide the whole thing by 9: p_1=0.000029.

We can make a similar plot for model 2:

The plot is now 3D, because we need to look at each combination of c_n and c_m. The height of each bar is:


The bars are only high when c_n is high and c_m is low. But look at the height of the high bars – much much higher than those in the previous plot. Some combinations of two coins could conceivable have generated this data. To compute p_2 we need to add up the height of all the bars and divide the whole thing by 81: p_2 = 0.008478

This is a small number, but almost 300 times bigger than p_1. The model selection procedure has backed up our initial hunch.

Example 2


i.e. 6 heads in the first 10, 5 heads in the second 10. This time it looks a bit like it might be the same coin. Here’s the plot for model 1:

And here’s the magic number: p_1 = 0.01667

Model 2:

With the number: p_2 = 0.01020.

This time, the model slightly favours model 1.

Why so slightly? Well, the data could have come from two coins – maybe the two coins are similar. Maybe it was the same coin selected twice. For this reason, it is always going to be easier to say that they are different than that they are the same.

Why would it ever favour model 1?

Given that we could always explain the data with two coins, why will the selection procedure ever favour one coin? The answer to this is one of the great benefits of the Bayesian paradigm. Within the Bayesian framework, there is an inbuilt notion of economy: a penalty associated with unnecessarily complex models.

We can see this very clearly here – look at the two plots for the second example – the heights of the highest bars are roughly the same: the best single coin performs about as well as the best pair of coins.  But, the average single coin is better than the average pair of coins.  This is because the total number of possibilities (9 -> 81) has increased faster than the total number of good possibilites. The average goodness has decreased.

Model 2 is more complex than model 1 – it has more things that we can change (2 coins rather than 1). In the Bayesian world, this increase in complexity has to be justified by much better explanatory power. In example 2, this is not the case.

In example 1 we don’t quite see the reverse effect. Now, model 1 had no good possibilities (no single coin could have realistically generated the data) whereas model 2 had some. Hence, model 2 was overwhelmingly more likely than model 1.


This is a simple example. It also doesn’t explain all of the important features of Bayesian model selection. However, I think it nicely shows the penalty that Bayesian model selection naturally imposes on overly complex models. There are other benefits to this approach that I haven’t discussed. For example, this doesn’t show the way in which p_1 and p_2 change as more data becomes available.

Relating back to Bayes rule

The final piece of the jigsaw is relating what we’ve done above to the equations that might accompany other examples of Bayesian model selection.  Bayes rule tells us that for model 1:

Pr(c_n|n_1,N_1,n_2,N_2) = \frac{Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_n)Pr(c_n)}{\sum_{o=1}^9 r(n_1|N_1,c_o)Pr(n_2|N_2,c_o)Pr(c_o)}

and for model 2:

Pr(c_n,c_m|n_1,N_1,n_2,N_2) = \frac{Pr(n_1|N_1,c_n)Pr(n_2|N_2,c_m)Pr(c_n)Pr(c_m)}{\sum_{o=1}^9 \sum_{q=1}^9 Pr(n_1|N_1,c_o)Pr(n_2|N_2,c_q)Pr(c_o)Pr(c_q)}
for model 2.

In both cases, we computed the denominator of the fraction – known as the evidence. In general, it may consist of an integral rather than a sum but the idea is the same: averaging over all parameters (coins or combinations of coins) to get a single measure of goodness for the model.

PCA, missing values and variance…

In a previous post, I compared the results of analysing data from various parliaments using classical PCA and Probabilistic PCA.  The point of this was to show the effect that naively handling missing values – assuming they had a value of 0 – had on the PCA analysis: It looked like MPs from the same parties were more different from one another than perhaps they really are – all those zeros were having too much of a say on the outcome.

Since then, I’ve been playing with this a little more, going in two connected directions.  Firstly, the data is binary and PCA/PPCA are both built to handle real values.  It would be better to use a PCA-like approach that is specifically designed for binary data.  Some exist in the literature; I’ve also made a new(ish) one myself.

Secondly, I’ve become interested in how best to handle missing values in the PPCA models. The PPCA approach I used previously simultaneously found the components and filled in the missing data as it (the model) would have expected to see it.  On the face of it, this seems reasonable.  An alternative is to explicitly build into the model the fact that not every MP-vote pair exists.  In this case, the model knows not to expect each MP to vote each time and only takes into account the pairs that do exist.

The difference between the two may seem subtle, but it makes quite a significant difference to the outcome.  In particular, both methods allow us to not just see whereabouts an MP is in the reduced space (where the dot is in the two-dimensional plots), but to get some idea of the variance of this position: how certain the model is of where the MP should be.  I would hope that the more often an MP voted, the more certain we’d be about her position.  The method that only takes into account the data that exists behaves in this way, the method that guesses the missing votes doesn’t.

Why?  Well, when the method comes to working out the position of the MP, it’s certainty will depend on (amongst other things) how much data there is.  In the case of the method that guesses, it sees its own predictions as ‘data’ and becomes much more confident (overly so).  Not only does it see more data (or what it thinks is data), the additional stuff it sees conforms exactly to what it expects to see (because it produced it itself).  This results in an endless confidence increasing (and hence uncertainty decreasing) spiral.

Of course, this is just one model and I’m sure that other models exist that impute missing values as they optimise without creating such a bias and it’s important to remember that this problem comes about here because the guessing of the missing values and the determining of the MP positions are done simultaneously. We can always impute them at the end, once the model is happy with its predictions of the MPs positions.

However, it opens up an interesting question as to whether we should ever be imputing missing values as we go along in models such as this when we have a model that doesn’t make it necessary.  People seem to do it, and I guess they do it because it seems like they’re getting something for nothing.  But, the amount of information at our disposal is determined by the data that we can see.  We can fill in gaps, but doing so doesn’t give us more information.  The problem with the PPCA model that I used is that it was filling in the numbers and then assuming that these values were to be treated on a par with the actual data.

The practical implication is that the results I showed probably gave the impression that the parties are more coherent than they actually are.  I’m going to redo this with the more sensible model soon.

Footnote: Interestingly, a recent JMLR paper tackles the problem of missing values in PCA in (a lot) of detail: ilin10a.pdf