Amtrak and Survival Analysis

I got the idea for this blog post while waiting ~40 minutes for my Amtrak train the other week. While I use Amtrak a handful of times a year, and generally appreciate it I do find it ridiculous how unreliable its timing can be. (This is mostly due to Amtrak not owning the tracks they use, but I will not get into that here). Most of the delays I've experienced lately haven't been too bad (40 minutes is on the high end), but when I lived in North Carolina and was often taking the Carolinian train from Durham to Charlotte, the story was quite different. I can't remember a single time when the train arrived on time to Durham, and often the delays were over an hour.

This brings me to the point of this post, which is to answer the question, when should I get to the Durham Amtrak station if I want to catch the Carolinian to Charlotte? I'll assume that real-time train delay data isn't available so that past information is all I have to go off of. Certainly if all the trains are actually an hour late, I might as well show up an hour past the scheduled time and I would still always catch the train. Amtrak claims the Carolinian is on time 20-30% of the time, so presumably showing up late would make you miss about this many trains.




Fig. 1: Delay of arrival of the 79 train to the Durham station for each day since 7/4/2006 (with cancelled trains omitted). Note that in the first year and a half of this data, there are almost no trains that arrive on time, but the situation has improved over the years.

All Amtrak arrival data since 7/4/2006 is available on this amazing website. I got all the data available for the 79 train arriving to the Durham station. I've plotted the arrival times during this time in Fig. 1.

A simple frequentist approach

I can consider each train trip as an "experiment" where I sample the distribution of arrival times to the Durham station. The particular train I take is just another experiment, and I would expect it to follow the available distribution of arrivals. Thus, the probability of me missing the train if I arrive \tau minutes after the scheduled time is

 p(\tau) = \frac{N(t<\tau)}{N(t\geq 0)}.

Where N(t data-recalc-dims=t')" /> counts the number of arrivals in the available data where the arrival t' is greater than the specified \tau. The question, then, is how much of the data to include in N(t data-recalc-dims=t')." /> To test this, I considered a half year's worth of data as a test set. Then, I figured out how much of the previous data I should use as my training set to most accurately capture the delays in the test set. I found that using a year of data prior to the prediction time worked the best. The method is not perfect; the percentage of missed trains predicted using the training set is about 10% off from the number in the test set, as there are year-to-year variations in the data.

A plot of p(\tau) using all the available data and only using the last year of data is shown in Fig. 2. Using only the last year to build the model, to have an 80% chance of making the train, one can show up about 20 minutes after the scheduled time. This also confirms Amtrak's estimate that their trains are on time 20-30% of the time. Even if one shows up an hour after the scheduled time, though, he or she still has a 36% chance of making the train!




Fig. 2: p(\tau) determined using all of the available data (blue) and only the last year of data (gold). I see that for delays longer than 60 minutes, the two curves are similar, indicating that for long waits either prediction method would give similar results. It appears that in the last year the shorter delays have been worse than the long-term average, as there are significant discrepancies in the curves for shorter delays.

A Bayesian approach

With a Bayesian approach, I would like to write down the probability of a delay, \delta, given the data of past arrivals, \mathcal{D}. I will call this p(\delta|\mathcal{D}). Suppose I have a model, characterized by a set of n unknown parameters \vec{a} that describes the probability of delay. I will assume all the important information that can be extracted from \mathcal{D} is contained in \vec{a}. Then, I can decompose the delay distribution as

 p(\delta|\mathcal{D}) = \int d^n \vec{a} \;\; p(\delta|\vec{a}) p(\vec{a}|\mathcal{D}).

Using Bayes theorem, p(\vec{a}|\mathcal{D}) can then be expressed as

 p(\vec{a}|\mathcal{D}) = \frac{p(\mathcal{D}|\vec{a})\pi(\vec{a})}{p(\mathcal{D})}.

Here, p(\mathcal{D}|\vec{a}) is the likelihood function (the model evaluated at all of the data points), \pi(\vec{a}) is the prior on the model parameters, and p(\mathcal{D}) is the evidence that serves as a normalization factor.I use non-informative priors for \pi(\vec{a}).

The question is, then, what the model should be. A priori, I have no reason to suspect any model over another, so I decided to try many and see which one described the data best. To do this, I used the Bayes factor, much like I used previously, with the different models representing different hypotheses. The evidence for a model \mathcal{M}_1 is f times greater than the evidence for a model \mathcal{M}_2 where

 f = \frac{p(\mathcal{D}|\mathcal{M}_1)}{p(\mathcal{D}|\mathcal{M}_1)}.

As the models are assumed to depend on parameters \vec{a} (note that a method that does not explicitly have a functional form, such as a machine learning method, could still be used if p(\mathcal{D}|\mathcal{M}) could be estimated another way)

 p(\mathcal{D}|\mathcal{M}) = \int d^n \vec{a} \;\; p(\mathcal{D}|\vec{a})\pi(\vec{a}|\mathcal{M}) = \int d^n \vec{a} \;\; \prod_{i=1}^N p(\delta_i|\vec{a})\pi(\vec{a}|\mathcal{M}).

Here, \delta_i are all of the delays contained in \mathcal{D}. This integral becomes difficult for large n (even n=3 is getting annoying). To make it more tractable, let l(\vec{a}) = \ln(p(\mathcal{D}|\vec{a})), and let \vec{a}^* be the value of the fit parameters that maximize l(\vec{a}). Expanding as a Taylor series gives

 p(\mathcal{D}|\vec{a}) = e^{l(\vec{a})} \approx e^{l(\vec{a}^*)}e^{\frac{1}{2}(\vec{a}-\vec{a}^*)^T H (\vec{a}-\vec{a}^*)}.

where H is the matrix of second derivatives of l(\vec{a}) evaluated at \vec{a}^*. The integral can be evaluated using the Laplace approximation, giving

 p(\mathcal{D}|\mathcal{M}) = \int d^n \vec{a} \;\; p(\mathcal{D}|\vec{a})\pi(\vec{a}|\mathcal{M}) \approx e^{l(\vec{a}^*)} \sqrt{\frac{(2\pi)^n}{\det(-H)}}\pi(\vec{a}^*|\mathcal{M}),

which can now be evaluated by finding a \vec{a}^* that maximizes p(\mathcal{D}|\vec{a}). (Regular priors much be chosen for \pi(\vec{a}^*|\mathcal{M}) since I have to evaluate the prior. I will ignore this point here). I tested the exponential, Gompertz, and Gamma/Gompertz distributions, and found under this procedure that the Gamma/Gompertz function described the data the best under this metric. Using this, I explicitly calculate p(\delta|\mathcal{D}), again under the Laplace approximation. This gives the curve shown in Fig. 3, which, as expected, looks quite similar to Fig. 2.

While this section got a bit technical, it confirms the results of the earlier simple analysis. In particular, this predicts that one should show up about 17 minutes after the scheduled arrival time to ensure that he or she will catch 80% of trains, and that one still has 30% chance of catching the train if one shows up an hour late to the Durham station.




Fig. 3: p(\delta|\mathcal{D}) calculated using only the last year of data. Note that this curve is quite similar to Fig. 2.

Conclusions

Since I downloaded all the data, 5 days have passed and in that time, the delay of the 79 train has been 22, 40, 51, 45, and 97 minutes. It's a small sample size, but it seems like the prediction that 80% of the time one would be fine showing up 20 minutes late to the Durham station isn't such a bad guideline.

Of course, both of these models described are still incomplete. Especially with the frequentist approach, I have not been careful about statistical uncertainties, and both methods are plagued by systematic uncertainties. One such systematic uncertainties is that all days are not equivalent. Certain days will be more likely to have a delay than other. For example, I am sure the Sunday after Thanksgiving almost always has delays. No such patterns are taken into account in the model, and for a true model of delays these should either be included or the effect of such systematic fluctuations could be characterized.

Scripts for this post are here.