Amtrak and Survival Analysis

I got the idea for this blog post while waiting ~40 minutes for my Amtrak train the other week. While I use Amtrak a handful of times a year, and generally appreciate it I do find it ridiculous how unreliable its timing can be. (This is mostly due to Amtrak not owning the tracks they use, but I will not get into that here). Most of the delays I've experienced lately haven't been too bad (40 minutes is on the high end), but when I lived in North Carolina and was often taking the Carolinian train from Durham to Charlotte, the story was quite different. I can't remember a single time when the train arrived on time to Durham, and often the delays were over an hour.

This brings me to the point of this post, which is to answer the question, when should I get to the Durham Amtrak station if I want to catch the Carolinian to Charlotte? I'll assume that real-time train delay data isn't available so that past information is all I have to go off of. Certainly if all the trains are actually an hour late, I might as well show up an hour past the scheduled time and I would still always catch the train. Amtrak claims the Carolinian is on time 20-30% of the time, so presumably showing up late would make you miss about this many trains.

Fig. 1: Delay of arrival of the 79 train to the Durham station for each day since 7/4/2006 (with cancelled trains omitted). Note that in the first year and a half of this data, there are almost no trains that arrive on time, but the situation has improved over the years.

All Amtrak arrival data since 7/4/2006 is available on this amazing website. I got all the data available for the 79 train arriving to the Durham station. I've plotted the arrival times during this time in Fig. 1.

A simple frequentist approach

I can consider each train trip as an "experiment" where I sample the distribution of arrival times to the Durham station. The particular train I take is just another experiment, and I would expect it to follow the available distribution of arrivals. Thus, the probability of me missing the train if I arrive \tau minutes after the scheduled time is

 p(\tau) = \frac{N(t<\tau)}{N(t\geq 0)}.

Where N(t data-recalc-dims=t')" /> counts the number of arrivals in the available data where the arrival t' is greater than the specified \tau. The question, then, is how much of the data to include in N(t data-recalc-dims=t')." /> To test this, I considered a half year's worth of data as a test set. Then, I figured out how much of the previous data I should use as my training set to most accurately capture the delays in the test set. I found that using a year of data prior to the prediction time worked the best. The method is not perfect; the percentage of missed trains predicted using the training set is about 10% off from the number in the test set, as there are year-to-year variations in the data.

A plot of p(\tau) using all the available data and only using the last year of data is shown in Fig. 2. Using only the last year to build the model, to have an 80% chance of making the train, one can show up about 20 minutes after the scheduled time. This also confirms Amtrak's estimate that their trains are on time 20-30% of the time. Even if one shows up an hour after the scheduled time, though, he or she still has a 36% chance of making the train!

Fig. 2: p(\tau) determined using all of the available data (blue) and only the last year of data (gold). I see that for delays longer than 60 minutes, the two curves are similar, indicating that for long waits either prediction method would give similar results. It appears that in the last year the shorter delays have been worse than the long-term average, as there are significant discrepancies in the curves for shorter delays.

A Bayesian approach

With a Bayesian approach, I would like to write down the probability of a delay, \delta, given the data of past arrivals, \mathcal{D}. I will call this p(\delta|\mathcal{D}). Suppose I have a model, characterized by a set of n unknown parameters \vec{a} that describes the probability of delay. I will assume all the important information that can be extracted from \mathcal{D} is contained in \vec{a}. Then, I can decompose the delay distribution as

 p(\delta|\mathcal{D}) = \int d^n \vec{a} \;\; p(\delta|\vec{a}) p(\vec{a}|\mathcal{D}).

Using Bayes theorem, p(\vec{a}|\mathcal{D}) can then be expressed as

 p(\vec{a}|\mathcal{D}) = \frac{p(\mathcal{D}|\vec{a})\pi(\vec{a})}{p(\mathcal{D})}.

Here, p(\mathcal{D}|\vec{a}) is the likelihood function (the model evaluated at all of the data points), \pi(\vec{a}) is the prior on the model parameters, and p(\mathcal{D}) is the evidence that serves as a normalization factor.I use non-informative priors for \pi(\vec{a}).

The question is, then, what the model should be. A priori, I have no reason to suspect any model over another, so I decided to try many and see which one described the data best. To do this, I used the Bayes factor, much like I used previously, with the different models representing different hypotheses. The evidence for a model \mathcal{M}_1 is f times greater than the evidence for a model \mathcal{M}_2 where

 f = \frac{p(\mathcal{D}|\mathcal{M}_1)}{p(\mathcal{D}|\mathcal{M}_1)}.

As the models are assumed to depend on parameters \vec{a} (note that a method that does not explicitly have a functional form, such as a machine learning method, could still be used if p(\mathcal{D}|\mathcal{M}) could be estimated another way)

 p(\mathcal{D}|\mathcal{M}) = \int d^n \vec{a} \;\; p(\mathcal{D}|\vec{a})\pi(\vec{a}|\mathcal{M}) = \int d^n \vec{a} \;\; \prod_{i=1}^N p(\delta_i|\vec{a})\pi(\vec{a}|\mathcal{M}).

Here, \delta_i are all of the delays contained in \mathcal{D}. This integral becomes difficult for large n (even n=3 is getting annoying). To make it more tractable, let l(\vec{a}) = \ln(p(\mathcal{D}|\vec{a})), and let \vec{a}^* be the value of the fit parameters that maximize l(\vec{a}). Expanding as a Taylor series gives

 p(\mathcal{D}|\vec{a}) = e^{l(\vec{a})} \approx e^{l(\vec{a}^*)}e^{\frac{1}{2}(\vec{a}-\vec{a}^*)^T H (\vec{a}-\vec{a}^*)}.

where H is the matrix of second derivatives of l(\vec{a}) evaluated at \vec{a}^*. The integral can be evaluated using the Laplace approximation, giving

 p(\mathcal{D}|\mathcal{M}) = \int d^n \vec{a} \;\; p(\mathcal{D}|\vec{a})\pi(\vec{a}|\mathcal{M}) \approx e^{l(\vec{a}^*)} \sqrt{\frac{(2\pi)^n}{\det(-H)}}\pi(\vec{a}^*|\mathcal{M}),

which can now be evaluated by finding a \vec{a}^* that maximizes p(\mathcal{D}|\vec{a}). (Regular priors much be chosen for \pi(\vec{a}^*|\mathcal{M}) since I have to evaluate the prior. I will ignore this point here). I tested the exponential, Gompertz, and Gamma/Gompertz distributions, and found under this procedure that the Gamma/Gompertz function described the data the best under this metric. Using this, I explicitly calculate p(\delta|\mathcal{D}), again under the Laplace approximation. This gives the curve shown in Fig. 3, which, as expected, looks quite similar to Fig. 2.

While this section got a bit technical, it confirms the results of the earlier simple analysis. In particular, this predicts that one should show up about 17 minutes after the scheduled arrival time to ensure that he or she will catch 80% of trains, and that one still has 30% chance of catching the train if one shows up an hour late to the Durham station.

Fig. 3: p(\delta|\mathcal{D}) calculated using only the last year of data. Note that this curve is quite similar to Fig. 2.


Since I downloaded all the data, 5 days have passed and in that time, the delay of the 79 train has been 22, 40, 51, 45, and 97 minutes. It's a small sample size, but it seems like the prediction that 80% of the time one would be fine showing up 20 minutes late to the Durham station isn't such a bad guideline.

Of course, both of these models described are still incomplete. Especially with the frequentist approach, I have not been careful about statistical uncertainties, and both methods are plagued by systematic uncertainties. One such systematic uncertainties is that all days are not equivalent. Certain days will be more likely to have a delay than other. For example, I am sure the Sunday after Thanksgiving almost always has delays. No such patterns are taken into account in the model, and for a true model of delays these should either be included or the effect of such systematic fluctuations could be characterized.

Scripts for this post are here.

Modeling Seating Using Statistical Mechanics

When I ride buses and trains I always find it fascinating that there are always people who stand before all of the seats are taken (I sometimes follow this pattern as well). I figured this could be modeled as some finite-temperature effect in a statistical mechanics system. This seemed appropriate as the process of seating reminded me of adsorption of a molecule onto a surface. This led me to [1], where free-for-all seating on an airplane is modeled. Here, I will talk about bus seating in a similar language.

The first assumption I make is that people are ideal non-interacting fermions. This means that two people cannot sit in the same seat or stand in the same space and that people have no free will and will just follow the laws of physics. An interaction can be something like a patron avoiding sitting next to someone who has not showered or a group of friends wanting to sit together. These can clearly happen, but I assume that in the collective behavior of everyone in the bus, these small effects are negligible. In fact, interactions would not be too hard to incorporate into the model (though I have no idea how to incorporate free will). These may seem like a bad assumptions, but for the collective behavior of everyone in the bus, what an individual thinks should not be too relevant. After all, all I really want to know is how people will distribute themselves in a bus, not whether a particular person will sit somewhere.

Simple bus model
Fig. 1: The bus model used. The blue are seats and orange are standing positions. I assume the maximum capacity of the bus is 64, which roughly agrees with some of the information I gathered from transit sites [2]. The seating arrangement is a typical one I have seen on a bus.

I will split the bus up into a lattice, with different spacings for seats and standers (with seaters taking more space). This is shown in the figure above. Now, assume the total capacity of the bus is N people and there are n people on a bus and these numbers are fixed (the bus is currently between stops). Since I am modeling the people as a fermion gas, the occupancy (the expected number of people) of a particular seat or a particular place to stand (labeled by index i) with energy \epsilon_i is given by:

n_i = \frac{1}{e^{(\epsilon_i-\mu)/T}+1}.

Note that in the statistical mechanics result, the temperature, T, is usually multiplied by Boltzmann's constant. Since this is not a true statistical mechanics system, Boltzmann's constant is not applicable, so I will assume temperature is measured in units where Boltzmann's constant is 1. Note that the temperature here has no relation to how hot it is inside or outside the bus. In fact, the temperature of the system may be lower if it is hotter outside, since this might mean more people feel tired and would like to sit down. The temperature characterizes how likely it is that a person will choose a "non-optimal" seat. If the temperature were 0, the seats would all be filled in order of lowest energy. A finite (nonzero) temperature will allow for someone to choose to stand instead of sit, for example.

The energy, \epsilon_i, is roughly how undesirable or uncomfortable a seat is. Thus, a seat will have a lower energy whereas the standing positions have a higher energy.

\mu is the chemical potential. This is roughly the most likely energy a seat or space that will be occupied next will have. It can be solved for by the relation:

n = \sum_{i=0}^N \frac{1}{e^{(\epsilon_i-\mu)/T}+1}.

This can easily be solved numerically. I used the secant method to do it.

I will use a simple model for the energies. I will assume that a sitting spot has some negative energy -\alpha. For the purposes of this study, I assume \alpha is the same for every seat, but it would be trivial to generalize this so that it is different for each seat. I will assume that a standing spot has no energy (Note that, this could be changed so that seats have no energy and that standing has some energy \alpha and the results would be unchanged). On buses I have also noticed an effect where there is an aversion to standing behind the back door. To model this, I will assume all seats in the back of the bus have an energy \beta added to them. This could easily be generalized such that there is an aversion to being far from a door, but I will keep this model simple.

If the temperature is too high (relative to \alpha and \beta) then the seats and standing spots look equivalent and all places have a roughly equal probability of being occupied. This is not the expectation, so I assume that I am not in this regime. Further, if the temperature is too low (again relative to \alpha and \beta), the seats will all fill before anyone stands. Again, this is not what I observe so I expect that the temperature is not in this regime. Thus, I look at cases when \alpha and \beta are within an order of magnitude or two of the temperature. Empirically, I found that \alpha/T should between 0.5-20. I also assume \beta is less than \alpha as sitting in the back should be more desirable than standing (which, you will recall, has a zero energy).

Seating configuration for various values of parameters
Fig. 2: Typical bus configurations for various values of \alpha and \beta for n=35. The people are black dots. Again, blue are seats and orange are standing positions. Note there is no absolute configuration of people since statistical mechanics is probabilistic in nature.

The figure above shows representative arrangements of people for various values of \alpha/T and \beta/T. Multiple values of these can produce similar results. It seems that \beta/T should be within a factor of 2 of \alpha/T to produce "realistic" results, though this would be more formalized with real observations.

Here, I just wanted to illustrate how physics can be used to model people so I have only considered a really basic model. One would probably not want to draw many conclusions from this, but it can be extended. There are certainly more desirable seats (and standing spots) than others, so the energies can be altered to reflect this instead of having the same energy for all seats. Another interesting thing could also consider how the bus fills instead of looking at the final state. With enough data, the arrangement of seats on a bus could probably be optimized using a model like this such that everyone is most comfortable.

Of course, this kind of reasoning can apply to other scenarios as well. I was once in an airport waiting room and noticed that there were many people standing although there were available seats. A similar type of reasoning may reveal better ways to arrange the chairs. I have also noticed that in lecture halls people tend to sit on the edge making it hard to get to the middle seat. This could again be modeled similarly and perhaps lecture hall designs could be optimized. I hope to make another blog entries along these lines at some time.

If you'd like to play around with this yourself, you can modify the python code I wrote. Let me know if you discover anything interesting!

1. Steffen, J. H., 2008b. A statistical mechanics model for free-for-all airplane passenger boarding. American Journal of Physics 76, 1114–1119.