Bayes

Amtrak and Survival Analysis

I got the idea for this blog post while waiting ~40 minutes for my Amtrak train the other week. While I use Amtrak a handful of times a year, and generally appreciate it I do find it ridiculous how unreliable its timing can be. (This is mostly due to Amtrak not owning the tracks they use, but I will not get into that here). Most of the delays I've experienced lately haven't been too bad (40 minutes is on the high end), but when I lived in North Carolina and was often taking the Carolinian train from Durham to Charlotte, the story was quite different. I can't remember a single time when the train arrived on time to Durham, and often the delays were over an hour.

This brings me to the point of this post, which is to answer the question, when should I get to the Durham Amtrak station if I want to catch the Carolinian to Charlotte? I'll assume that real-time train delay data isn't available so that past information is all I have to go off of. Certainly if all the trains are actually an hour late, I might as well show up an hour past the scheduled time and I would still always catch the train. Amtrak claims the Carolinian is on time 20-30% of the time, so presumably showing up late would make you miss about this many trains.




Fig. 1: Delay of arrival of the 79 train to the Durham station for each day since 7/4/2006 (with cancelled trains omitted). Note that in the first year and a half of this data, there are almost no trains that arrive on time, but the situation has improved over the years.

All Amtrak arrival data since 7/4/2006 is available on this amazing website. I got all the data available for the 79 train arriving to the Durham station. I've plotted the arrival times during this time in Fig. 1.

A simple frequentist approach

I can consider each train trip as an "experiment" where I sample the distribution of arrival times to the Durham station. The particular train I take is just another experiment, and I would expect it to follow the available distribution of arrivals. Thus, the probability of me missing the train if I arrive \tau minutes after the scheduled time is

 p(\tau) = \frac{N(t<\tau)}{N(t\geq 0)}.

Where N(t>t') counts the number of arrivals in the available data where the arrival t' is greater than the specified \tau. The question, then, is how much of the data to include in N(t>t'). To test this, I considered a half year's worth of data as a test set. Then, I figured out how much of the previous data I should use as my training set to most accurately capture the delays in the test set. I found that using a year of data prior to the prediction time worked the best. The method is not perfect; the percentage of missed trains predicted using the training set is about 10% off from the number in the test set, as there are year-to-year variations in the data.

A plot of p(\tau) using all the available data and only using the last year of data is shown in Fig. 2. Using only the last year to build the model, to have an 80% chance of making the train, one can show up about 20 minutes after the scheduled time. This also confirms Amtrak's estimate that their trains are on time 20-30% of the time. Even if one shows up an hour after the scheduled time, though, he or she still has a 36% chance of making the train!




Fig. 2: p(\tau) determined using all of the available data (blue) and only the last year of data (gold). I see that for delays longer than 60 minutes, the two curves are similar, indicating that for long waits either prediction method would give similar results. It appears that in the last year the shorter delays have been worse than the long-term average, as there are significant discrepancies in the curves for shorter delays.

A Bayesian approach

With a Bayesian approach, I would like to write down the probability of a delay, \delta, given the data of past arrivals, \mathcal{D}. I will call this p(\delta|\mathcal{D}). Suppose I have a model, characterized by a set of n unknown parameters \vec{a} that describes the probability of delay. I will assume all the important information that can be extracted from \mathcal{D} is contained in \vec{a}. Then, I can decompose the delay distribution as

 p(\delta|\mathcal{D}) = \int d^n \vec{a} \;\; p(\delta|\vec{a}) p(\vec{a}|\mathcal{D}).

Using Bayes theorem, p(\vec{a}|\mathcal{D}) can then be expressed as

 p(\vec{a}|\mathcal{D}) = \frac{p(\mathcal{D}|\vec{a})\pi(\vec{a})}{p(\mathcal{D})}.

Here, p(\mathcal{D}|\vec{a}) is the likelihood function (the model evaluated at all of the data points), \pi(\vec{a}) is the prior on the model parameters, and p(\mathcal{D}) is the evidence that serves as a normalization factor.I use non-informative priors for \pi(\vec{a}).

The question is, then, what the model should be. A priori, I have no reason to suspect any model over another, so I decided to try many and see which one described the data best. To do this, I used the Bayes factor, much like I used previously, with the different models representing different hypotheses. The evidence for a model \mathcal{M}_1 is f times greater than the evidence for a model \mathcal{M}_2 where

 f = \frac{p(\mathcal{D}|\mathcal{M}_1)}{p(\mathcal{D}|\mathcal{M}_1)}.

As the models are assumed to depend on parameters \vec{a} (note that a method that does not explicitly have a functional form, such as a machine learning method, could still be used if p(\mathcal{D}|\mathcal{M}) could be estimated another way)

 p(\mathcal{D}|\mathcal{M}) = \int d^n \vec{a} \;\; p(\mathcal{D}|\vec{a})\pi(\vec{a}|\mathcal{M}) = \int d^n \vec{a} \;\; \prod_{i=1}^N p(\delta_i|\vec{a})\pi(\vec{a}|\mathcal{M}).

Here, \delta_i are all of the delays contained in \mathcal{D}. This integral becomes difficult for large n (even n=3 is getting annoying). To make it more tractable, let l(\vec{a}) = \ln(p(\mathcal{D}|\vec{a})), and let \vec{a}^* be the value of the fit parameters that maximize l(\vec{a}). Expanding as a Taylor series gives

 p(\mathcal{D}|\vec{a}) = e^{l(\vec{a})} \approx e^{l(\vec{a}^*)}e^{\frac{1}{2}(\vec{a}-\vec{a}^*)^T H (\vec{a}-\vec{a}^*)}.

where H is the matrix of second derivatives of l(\vec{a}) evaluated at \vec{a}^*. The integral can be evaluated using the Laplace approximation, giving

 p(\mathcal{D}|\mathcal{M}) = \int d^n \vec{a} \;\; p(\mathcal{D}|\vec{a})\pi(\vec{a}|\mathcal{M}) \approx e^{l(\vec{a}^*)} \sqrt{\frac{(2\pi)^n}{\det(-H)}}\pi(\vec{a}^*|\mathcal{M}),

which can now be evaluated by finding a \vec{a}^* that maximizes p(\mathcal{D}|\vec{a}). (Regular priors much be chosen for \pi(\vec{a}^*|\mathcal{M}) since I have to evaluate the prior. I will ignore this point here). I tested the exponential, Gompertz, and Gamma/Gompertz distributions, and found under this procedure that the Gamma/Gompertz function described the data the best under this metric. Using this, I explicitly calculate p(\delta|\mathcal{D}), again under the Laplace approximation. This gives the curve shown in Fig. 3, which, as expected, looks quite similar to Fig. 2.

While this section got a bit technical, it confirms the results of the earlier simple analysis. In particular, this predicts that one should show up about 17 minutes after the scheduled arrival time to ensure that he or she will catch 80% of trains, and that one still has 30% chance of catching the train if one shows up an hour late to the Durham station.




Fig. 3: p(\delta|\mathcal{D}) calculated using only the last year of data. Note that this curve is quite similar to Fig. 2.

Conclusions

Since I downloaded all the data, 5 days have passed and in that time, the delay of the 79 train has been 22, 40, 51, 45, and 97 minutes. It's a small sample size, but it seems like the prediction that 80% of the time one would be fine showing up 20 minutes late to the Durham station isn't such a bad guideline.

Of course, both of these models described are still incomplete. Especially with the frequentist approach, I have not been careful about statistical uncertainties, and both methods are plagued by systematic uncertainties. One such systematic uncertainties is that all days are not equivalent. Certain days will be more likely to have a delay than other. For example, I am sure the Sunday after Thanksgiving almost always has delays. No such patterns are taken into account in the model, and for a true model of delays these should either be included or the effect of such systematic fluctuations could be characterized.

Hypothesis Testing in a Jury

I recently served on a jury and was quite surprised as how unobjective some of the other jurors were being when thinking about the case. For our case, it turned out not to matter because the decision was obvious, but it got me thinking about a formal reasoning behind "beyond a reasonable doubt." This reasoning will involve more statistics than physics, but considering I've been thinking about Bayesian analyses recently in my research, it's quite appropriate.

At the most basic level, a jury decision is a hypothesis test. I wish to distinguish between the hypotheses of not guilty (call it \mathcal{H}_0, since the defendant is innocent until proven guilty) and guilty (call it \mathcal{H}_1). In Bayesian statistics, the way to compare two hypotheses is by computing the ratio of posterior probabilities.

F=\frac{p(\mathcal{H}_1|D)}{p(\mathcal{H}_0|D)}.

Where p(\mathcal{H}|D) is the probability of the assumption \mathcal{H} given the available data (the evidence). This probably does not seem obvious to compute, and I'll discuss later how one might determine values for these. If F=2, then the hypothesis \mathcal{H}_1 is twice as likely as the hypothesis \mathcal{H}_0. Thus, if F \gg 1, then the evidence for \mathcal{H}_1 is overwhelming. What this means in terms of "beyond a reasonable doubt" is debatable, but it is generally accepted that if F \gtrsim 100, there is strong evidence for \mathcal{H}_1 over \mathcal{H}_0 [1]. Similarly, if F\ll 1, then the evidence for \mathcal{H}_0 is overwhelming. Thus, the question or determining guilt or not guilt is equivalent to calculating F.

p(\mathcal{H}|D) can be rewritten using Bayes theorem as

 p(\mathcal{H}|D)=\frac{p(D|\mathcal{H})p(\mathcal{H})}{p(D)}.

Here, p(D|\mathcal{H}) is the probability of the evidence given the assumption of guiltiness (or not-guiltiness), which is more tractable than p(\mathcal{H}|D) itself. Note the prosecutor's fallacy can be thought of as confusing p(D|\mathcal{H}) with p(\mathcal{H}|D). p(\mathcal{H}) is the prior, which takes into account how much one believes \mathcal{H} with regard to other hypotheses. p(D) is a normalization factor to ensure probabilities are always less than 1. In the relation for F, this cancels out, so there is no need to worry about this term. With this replacement, the ratio of posterior probabilities becomes

F = \frac{p(D|\mathcal{H}_1)}{p(D|\mathcal{H}_0)}\frac{p(\mathcal{H}_1)}{p(\mathcal{H}_0)}.

The first ratio is called the Bayes factor. The second ratio quantifies the ratio of prior beliefs of the hypotheses. The ratio is, given no other information, the odds that the defendant is guilty. Suppose I accept that there was a crime committed, but that the identity of the criminal is in question. If there is only one person that committed the crime, this would then be the inverse of the number of people who could have committed the crime.

Now, I will consider the calculation of the Bayes factor for a real trial. Consider R v. Adams, which set a precedent for banning explicit Bayesian reasoning in British Courts in the context of DNA evidence. It was estimated during the trial that there were roughly 200,000 men in the age range 20-60 who could have committed a crime. Note that some extra assumptions on age and gender seem to be made here, so this does not seem applicable to the ratio of prior beliefs. However, if I lifted these restrictions, the Bayes factor for the victim's misidentification of the defendant would change accordingly, so this is not a concern.

First, consider the DNA evidence that was the only piece of evidence incriminating the defendant. Call this evidence D_0. p(D_0|\mathcal{H}_1) is the probability of a positive DNA match under the assumption that the defendant is guilty. This is presumably extremely close to 1, or DNA evidence would not be considered good evidence in trials. p(D_0|\mathcal{H}_0) is the probability of a positive match if the defendant is not guilty. Taking into account the population of the U.K., this was estimated in the trial to be between 1 in 2 million and 1 in 200 million (though possibly as low as 1 in 200 since the defendant had a half-brother) [2]. Thus, the Bayes factor considering only the DNA evidence, is between 2 million and 200 million. With the 1 in 200,000 prior probability, the posterior probability ratio F is between 10 and 1000. Only the higher end of this range is overwhelming evidence, and in the case of conflicting evidence, the jury is supposed to give the benefit of the doubt to the defendant, so it seems a "not guilty" verdict would have been appropriate.

Further, this ignores all of the other evidence that helped to prove the defendant's innocence. This included the victim failing to identify the defendant as the attacker and the defendant having an alibi for the night in question. Let us call these two pieces of evidence D_1 and D_2. Unlike the DNA evidence, the witnesses do not explicitly mention what the relevant probabilities are for this evidence, so it is up to the jurors to make reasonable estimates for these quantities. p(D_1|\mathcal{H}_1) is the probability the victim fails to identify the defendant as the attacker given the defendant's guilt. Set this to be around 10%, though police departments may actually have statistics on this rate. On the other hand, p(D_1|\mathcal{H}_0), the probability the victim fails to identify the defendant as the attacker given the defendant is not guilty is high, say around 90%. Thus, the Bayes factor considering the victim's failed identification of the defendant is about 1 in 10. Note that even if these numbers change by 10% this factor doesn't change in order of magnitude, so as long as a reasonable estimate is made for this factor, it doesn't really matter what the actual value is. The alibi is less convincing. Though the defendant's girlfriend testified, the defendant and the girlfriend could have confirmed their story with each other. Thus, I estimate the Bayes factor for the alibi p(D_2|\mathcal{H}_1)/p(D_2|\mathcal{H}_0) to be about 1 in 2.

Since all these pieces of evidence are independent, p(D_0,D_1,D_2|\mathcal{H})=p(D_0|\mathcal{H})p(D_1|\mathcal{H})p(D_2|\mathcal{H}). Thus, the Bayes factor for all evidence is between 100,000 and 10,000,000. Now, multiplying by the prior probability, this gives a posterior probability ratio, F, between 0.5 and 50. With the new evidence taken into account, there is no longer strong evidence that the defendant is guilty even in the best case scenario for the prosecution. Convicting someone with a posterior probability ratio of 50 would falsely convict people 2% of the time, which seems like an unacceptable rate if one is taking the notion of innocent until proven guilty seriously. Note that as long as the order of magnitude of each of the Bayes factor estimates doesn't change, the final result will also not change by more than 1 order of magnitude, so the outcome is fairly robust.

While this line of logic was presented to the jurors during the trial, the jurors still found the defendant guilty. The judge took objection to coming out with a definite number for the odds of guilt when the assumptions going into it are uncertain, though as argued before, for any reasonable choice, these numbers cannot change too much [2]. It seems that without formal training in statistics, it was difficult to accept these rules as "objective," even though this is a provably, well-defined, mathematical way to arrive at a decision. If common sense and the rules of logic and probability are really what jurors are considering to reach their decision, this has to be the outcome that they reach. [3] argues that to not believe the outcome of a Bayesian argument like this would be akin to not believing the result of a long division calculation done on a calculator.

The most common objection to Bayesian reasoning (though apparently not the one the judges in R v. Adams had) is that the choice of prior can be somewhat arbitrary. In the example above, in the estimation of people that could have committed the crime, I could take people who were on the same block, the same neighborhood, the same city, or maybe even the same state. Each of these would certainly give different answers, so care must be taken to choose appropriate values for the prior. This doesn't make the method wrong or unobjective. It is just that the method cannot start with no absolutely no assumptions. Given a basic assumption, though, it provides a systematic way to see how the basic assumption changes when all the evidence is considered.

This problem seems to stem from a misunderstanding of basic statistics by the jury, attorneys, and judges. Another example of this is the U.K. Judge Edwards-Stuart who claimed that putting a probability on an event that has already happened is "pseudo-mathematics" [4]. This just shows the judge's ignorance, as this is precisely the type of problem Bayesian inference can explain. It is a shame that in the U.K. Bayesian reasoning is actively discouraged due to R v. Adams, as this is the only rigorous way to deal with these types of problems. I wasn't able to find any specific examples in the U.S., but I assume the "fear" of Bayesian statistics in courts here is similar to the case in the U.K.

References
1. Jeffreys, H. 1998. The Theory of Probability.
2. Lynch, M. and McNally, M., 2003. “Science,” “common sense,” and DNA evidence: a legal controversy about the public understanding of science. Public Understanding of Science, 12-1, 83-103.
3. Fenton, N. and Neil, M., 2011. Avoiding Probabilistic Reasoning Fallacies in Legal Practice using Bayesian Networks. Austl. J. Leg. Phil., 36, 114-151.
4. Nulty and Others v. Milton Keynes Borough Council, 2013. EWCA Civ 15.