Hypothesis Testing in a Jury

I recently served on a jury and was quite surprised as how unobjective some of the other jurors were being when thinking about the case. For our case, it turned out not to matter because the decision was obvious, but it got me thinking about a formal reasoning behind "beyond a reasonable doubt." This reasoning will involve more statistics than physics, but considering I've been thinking about Bayesian analyses recently in my research, it's quite appropriate.

At the most basic level, a jury decision is a hypothesis test. I wish to distinguish between the hypotheses of not guilty (call it \mathcal{H}_0, since the defendant is innocent until proven guilty) and guilty (call it \mathcal{H}_1). In Bayesian statistics, the way to compare two hypotheses is by computing the ratio of posterior probabilities.


Where p(\mathcal{H}|D) is the probability of the assumption \mathcal{H} given the available data (the evidence). This probably does not seem obvious to compute, and I'll discuss later how one might determine values for these. If F=2, then the hypothesis \mathcal{H}_1 is twice as likely as the hypothesis \mathcal{H}_0. Thus, if F \gg 1, then the evidence for \mathcal{H}_1 is overwhelming. What this means in terms of "beyond a reasonable doubt" is debatable, but it is generally accepted that if F \gtrsim 100, there is strong evidence for \mathcal{H}_1 over \mathcal{H}_0 [1]. Similarly, if F\ll 1, then the evidence for \mathcal{H}_0 is overwhelming. Thus, the question or determining guilt or not guilt is equivalent to calculating F.

p(\mathcal{H}|D) can be rewritten using Bayes theorem as


Here, p(D|\mathcal{H}) is the probability of the evidence given the assumption of guiltiness (or not-guiltiness), which is more tractable than p(\mathcal{H}|D) itself. Note the prosecutor's fallacy can be thought of as confusing p(D|\mathcal{H}) with p(\mathcal{H}|D). p(\mathcal{H}) is the prior, which takes into account how much one believes \mathcal{H} with regard to other hypotheses. p(D) is a normalization factor to ensure probabilities are always less than 1. In the relation for F, this cancels out, so there is no need to worry about this term. With this replacement, the ratio of posterior probabilities becomes

F = \frac{p(D|\mathcal{H}_1)}{p(D|\mathcal{H}_0)}\frac{p(\mathcal{H}_1)}{p(\mathcal{H}_0)}.

The first ratio is called the Bayes factor. The second ratio quantifies the ratio of prior beliefs of the hypotheses. The ratio is, given no other information, the odds that the defendant is guilty. Suppose I accept that there was a crime committed, but that the identity of the criminal is in question. If there is only one person that committed the crime, this would then be the inverse of the number of people who could have committed the crime.

Now, I will consider the calculation of the Bayes factor for a real trial. Consider R v. Adams, which set a precedent for banning explicit Bayesian reasoning in British Courts in the context of DNA evidence. It was estimated during the trial that there were roughly 200,000 men in the age range 20-60 who could have committed a crime. Note that some extra assumptions on age and gender seem to be made here, so this does not seem applicable to the ratio of prior beliefs. However, if I lifted these restrictions, the Bayes factor for the victim's misidentification of the defendant would change accordingly, so this is not a concern.

First, consider the DNA evidence that was the only piece of evidence incriminating the defendant. Call this evidence D_0. p(D_0|\mathcal{H}_1) is the probability of a positive DNA match under the assumption that the defendant is guilty. This is presumably extremely close to 1, or DNA evidence would not be considered good evidence in trials. p(D_0|\mathcal{H}_0) is the probability of a positive match if the defendant is not guilty. Taking into account the population of the U.K., this was estimated in the trial to be between 1 in 2 million and 1 in 200 million (though possibly as low as 1 in 200 since the defendant had a half-brother) [2]. Thus, the Bayes factor considering only the DNA evidence, is between 2 million and 200 million. With the 1 in 200,000 prior probability, the posterior probability ratio F is between 10 and 1000. Only the higher end of this range is overwhelming evidence, and in the case of conflicting evidence, the jury is supposed to give the benefit of the doubt to the defendant, so it seems a "not guilty" verdict would have been appropriate.

Further, this ignores all of the other evidence that helped to prove the defendant's innocence. This included the victim failing to identify the defendant as the attacker and the defendant having an alibi for the night in question. Let us call these two pieces of evidence D_1 and D_2. Unlike the DNA evidence, the witnesses do not explicitly mention what the relevant probabilities are for this evidence, so it is up to the jurors to make reasonable estimates for these quantities. p(D_1|\mathcal{H}_1) is the probability the victim fails to identify the defendant as the attacker given the defendant's guilt. Set this to be around 10%, though police departments may actually have statistics on this rate. On the other hand, p(D_1|\mathcal{H}_0), the probability the victim fails to identify the defendant as the attacker given the defendant is not guilty is high, say around 90%. Thus, the Bayes factor considering the victim's failed identification of the defendant is about 1 in 10. Note that even if these numbers change by 10% this factor doesn't change in order of magnitude, so as long as a reasonable estimate is made for this factor, it doesn't really matter what the actual value is. The alibi is less convincing. Though the defendant's girlfriend testified, the defendant and the girlfriend could have confirmed their story with each other. Thus, I estimate the Bayes factor for the alibi p(D_2|\mathcal{H}_1)/p(D_2|\mathcal{H}_0) to be about 1 in 2.

Since all these pieces of evidence are independent, p(D_0,D_1,D_2|\mathcal{H})=p(D_0|\mathcal{H})p(D_1|\mathcal{H})p(D_2|\mathcal{H}). Thus, the Bayes factor for all evidence is between 100,000 and 10,000,000. Now, multiplying by the prior probability, this gives a posterior probability ratio, F, between 0.5 and 50. With the new evidence taken into account, there is no longer strong evidence that the defendant is guilty even in the best case scenario for the prosecution. Convicting someone with a posterior probability ratio of 50 would falsely convict people 2% of the time, which seems like an unacceptable rate if one is taking the notion of innocent until proven guilty seriously. Note that as long as the order of magnitude of each of the Bayes factor estimates doesn't change, the final result will also not change by more than 1 order of magnitude, so the outcome is fairly robust.

While this line of logic was presented to the jurors during the trial, the jurors still found the defendant guilty. The judge took objection to coming out with a definite number for the odds of guilt when the assumptions going into it are uncertain, though as argued before, for any reasonable choice, these numbers cannot change too much [2]. It seems that without formal training in statistics, it was difficult to accept these rules as "objective," even though this is a provably, well-defined, mathematical way to arrive at a decision. If common sense and the rules of logic and probability are really what jurors are considering to reach their decision, this has to be the outcome that they reach. [3] argues that to not believe the outcome of a Bayesian argument like this would be akin to not believing the result of a long division calculation done on a calculator.

The most common objection to Bayesian reasoning (though apparently not the one the judges in R v. Adams had) is that the choice of prior can be somewhat arbitrary. In the example above, in the estimation of people that could have committed the crime, I could take people who were on the same block, the same neighborhood, the same city, or maybe even the same state. Each of these would certainly give different answers, so care must be taken to choose appropriate values for the prior. This doesn't make the method wrong or unobjective. It is just that the method cannot start with no absolutely no assumptions. Given a basic assumption, though, it provides a systematic way to see how the basic assumption changes when all the evidence is considered.

This problem seems to stem from a misunderstanding of basic statistics by the jury, attorneys, and judges. Another example of this is the U.K. Judge Edwards-Stuart who claimed that putting a probability on an event that has already happened is "pseudo-mathematics" [4]. This just shows the judge's ignorance, as this is precisely the type of problem Bayesian inference can explain. It is a shame that in the U.K. Bayesian reasoning is actively discouraged due to R v. Adams, as this is the only rigorous way to deal with these types of problems. I wasn't able to find any specific examples in the U.S., but I assume the "fear" of Bayesian statistics in courts here is similar to the case in the U.K.

1. Jeffreys, H. 1998. The Theory of Probability.
2. Lynch, M. and McNally, M., 2003. “Science,” “common sense,” and DNA evidence: a legal controversy about the public understanding of science. Public Understanding of Science, 12-1, 83-103.
3. Fenton, N. and Neil, M., 2011. Avoiding Probabilistic Reasoning Fallacies in Legal Practice using Bayesian Networks. Austl. J. Leg. Phil., 36, 114-151.
4. Nulty and Others v. Milton Keynes Borough Council, 2013. EWCA Civ 15.