Making Sense of Polls - Part 1

I'm a huge fan of fivethirtyeight (really I read almost every article written), and I think they do a great job of interpreting poll results. However, I do notice that the process in which they compile the results is tedious. Some of the tediousness, like collecting poll results, is inevitable. However, other work including figuring out which polls are reliable and how to interpret polls across different states seems like it might be automatable. I took a stab at this automation using data that fivethirtyeight compiled, which has compiled polling data for various state and national elections three weeks before the election date. This data was about 6200 polls, which is a relatively small sample to train a machine learning model on.

I've identified some key points that need to be figured out for any election forecast:
1) How to combine poll results (assuming all polls are good). This will be some form of a moving average, but exactly what kind of moving average taken is up for debate.
2) How to decide which polls are good. Bad polls are a problem. If a pollster is partisan, there needs to be a way to take into account.
3) Estimating the uncertainty in the combined polls. The sampling error is relatively small for most polls, but if pollsters choose not to publish a poll if it disagrees with the common wisdom, this can introduce bias. There is also uncertainty about how uncertain voters and third-party voters will swing.
4) How to determine correlations in the polls. That is, if a candidate performs worse than the polling average would suggest in Pennsylvania, there is likely to be a similar pattern in Wisconsin.

The last issue was tricky, and will not be covered here, but the first three issues are discussed in this post.

I tackle the problem as a time series prediction problem. That is, given a time series (when the polls happen and their results) I want to predict the outcome of the election. This time series can be interpreted as a sequence of events, which means recurrent neural networks (RNNs) are well-suited to solve the problem. RNNs even handle different-length sequences quite naturally, which is a plus as this is awkward to encode into useful input for other types of models like a tree-based model.

I use the data prior to 2014 as a training set and 2014 as a validation set. Once I've tuned all the parameters on 2014, I retrain the model with both the training and validation set and predict for 2016 presidential races.

In this post, I tackle the problem of predicting individual races, instead of whole elections (or the whole presidential race, which is equally complex). For each poll, I compile a set of information about the polls:
• How each candidate in the race polled, in order of democrat, republican, and highest polling third party candidate, if available. The order is relevant as I account for the political preferences of polls.
• The number of days before the election that the poll was conducted¹.
• The sample size of the poll².
• The partisan leaning of the polling agency, if known
• Whether a live caller was used to conduct the poll (default to false if unknown).
• Whether the poll is an Internet poll (default to false if unknown).
• Whether the pollster is a member of the NCPP, AAPOR, or Roper organizations.
• Of polls conducted before the poll, the percentile of the number of polls the pollster has conducted relative to all other pollsters. The intuition here is that agencies that do one poll are not very reliable, whereas agencies that have done many polls probably are.

All of this information is collected into a single vector, which I will call the poll information. I then take all polls of the same race previous to that poll and make an ordered list of the poll informations, which is the sequence that is the input to the neural network model.

With this as input, I have the neural network predict the ultimate margin of the election. I do not include any sense of "year" as input to the neural network as I wish the model to extrapolate on this variable and hence I do not want the model to overfit to any trends there may be in this variable. I use three LSTM layers with 256 units followed by two fully connected layers with 512 neurons. The output layer is one neuron that predicts the final margin of the election. Mean squared error is used as the loss function. I initialize all weights randomly (using the keras defaults), but there might be a benefit to initialize by transfer learning from an exponentially weighted moving average.

I use dropout at time of prediction as a way to get an estimate of the error in the output of the model. The range where 90% of predictions lie using different RNG seeds for the dropout gives a confidence interval³. To calibrate the amount of dropout to apply after each layer, I trained the model on a training set (polls for elections before 2014) and tested different levels of dropout on the validation set (the 2014 election). I find the percentile of the true election result within the Monte Carlo model predictions. Thus, a perfectly calibrated model would have a uniform distribution of the percentile of true election results within the Monte Carlo model predictions. Of course, I do not expect the model to ever be perfectly calibrated, so I chose the dropout rate that minimized the KS-test statistic with the uniform distribution. This turned out to be 40%, which was comforting as this is a typical choice for dropout at training.

A comparison of the calibration (blue) and ideal (green) CDFs for predictions on the test set. For the calibration curve, the optimal dropout of 40% is used.

I then retrain the model using all data prior to 2016 (thus using both the training and validation set). I take the calibrated dropout rate and again get Monte Carlo samples for polls, using the newly trained model, on the test set. I then simply count the number of positive margin Monte Carlo outputs to get a probability that the election swings in favor of the first party.

Applying this to a race, I find that the uncertainty in ultimate election margin pretty large, around 15%. This isn't any larger than predictions with other aggregation methods, but this shows that it really is tough to accurately call close races just based on polls.

Margin predicted (90% CI) by the model for the 2016 presidential election in Pennsylvania. The red line shows the actual margin.

Though this model hasn't learned the relationships between states, I tried applying it to the 2016 presidential election. To get the probability of a candidate winning based on the polls available that day, for each state I run 1000 predictions with different RNG seeds. For each of these 1000 predictions, I add up the electoral votes the candidate would win if they had the predicted margins. The probability of the candidate winning is then the percentage of these outcomes that is below 270.

Histograms of possible presidential election outcomes predicted by the model each day before the election. The outcomes to the left of the red line are cases that result in a Republic victory (the ultimate outcome).

Ultimately, the model showed there was a 93% chance of Clinton winning the election on election day. This is already a more conservative estimate than what some news sources predicted.

The probability of Clinton winning the 2016 election predicted by the model as a function of days before the election.

Unless the 2016 election truly was a rare event, this shows that clearly, the model is incomplete. Relationships between how states vote compared to polling are crucial to capture. It would also be useful to include more polls in the training set to learn how to aggregate polls more effectively, and in particular, better discern which pollsters are reliable. More features, such as whether the incumbent president is running or if it is an off-year election may also add more information in the predictions. I'll explore some of these ideas in a future blog post.

Code for this blog post is available here.

1. Divide by 365 days to normalize.
2. This is measured in thousands. I normalized using a tanh to get a number between 0 and 1. When not available, this defaults to 0.
3. Though I used Monte Carlo, which seems like a frequentist way to arrive at the confidence interval, it is actually a Bayesian method (see also this paper).
4. There is worry that the number of simulated samples is not sufficient to estimate the percentile when the actual margin is less than the smallest prediction value or larger than the largest prediction value. This happened less than 0.1% of the time for most of the choices of dropout rate and is not the main contributor to the KS test statistic, so it is ignored here.

Who will win Top Chef Season 14?

Warning: Spoilers ahead if you have not seen the first two episodes of the new season

In the first episode of the season Brooke, after winning the quickfire, claimed she was in a good position because the winner of the first challenge often goes on to win the whole thing. Actually, only one contestant has won the first quickfire and gone on to win the whole thing (Richard in season 8), and that was a team win. The winner of the first elimination challenge has won the competition 5 of 12 times (not counting season 9 when a whole team won the elimination challenge). This got me wondering if there were other predictors as to who would win Top Chef.

There's not too much data after the first elimination challenge, but I tried building a predictive model using the chef's gender, age, quickfire and elimination performance, and current residence (though I ultimately selected the most predictive features from the list). I used this data as features with a target variable of elimination number to build a gradient-boosted decision tree model to predict when the chefs this season would be eliminated. I validated the model with seasons 12 and 13 and then applied the model to season 14. I looked at the total distance between the predicted and actual placings of the contestants as the metric to optimize during validation. The model predicted both of these seasons correctly, but seasons 12 and 13 were two seasons where the winner of the first elimination challenge became top chef.

The most important features in predicting the winner were: elimination challenge 1 performance, season (catching general trends across seasons), gender, home state advantage, being from Boston, being from California, and being from Chicago. Male chefs do happen to do better as do chefs from the state where Top Chef is being filmed. Being from Chicago is a little better than being from California, which is better than being from Chicago. To try to visualize this better, I used these important features and performed a PCA to plot the data in two dimensions. This shows how data clusters, without any knowledge of the ultimate placement of the contestants.

A plot of the PCA components using the key identified features. The colors represent the ultimate position of the contestants. Blue represents more successful contestants where red represents less successful contestants. The $x$ direction corresponds mostly to first elimination success (with more successful contestants on the right) and the $y$ direction corresponds mostly to gender (with male on top). The smaller spreads correspond to the other features, such as the contestant's home city. We see that even toward the left there are dark blue points, meaning that nothing is a certain deal-breaker in terms of winning the competition, but of course, winning the first challenge puts you in a better position.

My prediction model quite predictably puts Casey as the favorite for winning it all, with Katsuji in second place. The odds are a bit stacked against Casey though. If she were male or from Chicago or if this season's Top Chef were taking place in California, she would have a higher chance of winning. Katsuji's elevated prediction is coming from being on the winning team in the first elimination while being male and from California. He struggled a bit when he was last on the show, though, so I don't know if my personal prediction would put him so high. Brooke, even though she thought she was in a good position this season, is tied for fifth place according to my prediction. My personal prediction would probably put her higher since she did so well in her previous season.

Of course, there's only so much the models can predict. For one thing, there's not enough data to reliably figure out how returning chefs do. This season, it's half new and half old contestants. The model probably learned a bit of this, though, since the experienced chefs won the first elimination challenge, which was included in the model. One thing I thought about including but didn't was what the chefs actually cooked. I thought certain ingredients or cooking techniques might be relevant features for the predictive model. However, this data wasn't easy to find without re-watching all the episodes, and given the constraints of all the challenges, I wasn't sure these features would be all that relevant (e.g. season 11 was probably the only time turtle was cooked in an elimination challenge). Obviously, with more data the model would get better; most winners rack up some wins by the time a few elimination challenges have passed.

The code is available here.

I have been thinking a lot about teaching lately (maybe now that I will not be teaching anymore) and I hope to write a series of a few blog posts about it. My first post here will be on grade inflation, specifically whether curving is an effective way to combat it.

A popular method to combat grade inflation seems to be to impose a set curve for all classes. That is, for example, the top 25% of students get As, the next 35% get Bs and the bottom 40% gets Cs, Ds, and Fs (which is the guideline for my class). While this necessarily avoids the problem of too many people getting As, it can be a bit too rigid, which I will show below.

In the class I teach, there are ~350 students, who are spread among three lectures. I will investigate what effect the splitting of students into the lectures has on their grade. First, I will make an incredibly simple model where I assume there is a "true" ranking of all the students. That is, if all the students were actually in one big class, this would be the ordering of their grades in the course. I will assume that the assessments given to the students in the classes they end up in are completely fair. That is, if their "true" ranking is the highest of anyone in the class, they will get the highest grade in the class and if their "true" ranking is the second highest of anyone in the class they will get the second highest grade and so on. I then assign students randomly to three classes and see how much their percentile in the class fluctuates based on these random choices. This is shown below

The straight black line shows the percentile a student would have gotten had the students been in one large lecture. The black curve above and below it shows the 90% variability in percentile due to random assignment.

We see that even random assignment can cause significant fluctuations, and creates variability particularly for the students in the "middle of the pack." Most students apart from those at the top and bottom could have their letter grade change by a third of a letter grade just due to how the classes were chosen.

Further, this was just assuming the assignment was random. Often, the 8 am lecture has more freshman because they register last and lectures at better times are likely to fill up. There may also be a class that advanced students would sign up for that conflicts with one of the lecture times of my course. This would cause these advanced students to prefer taking the lectures at other times. These effects would only make the story worse from what's shown above.

We have also assumed that each class has the ability to perfectly rank the students from best to lowest. Unfortunately, there is variability in how exam problems are graded and how good questions are at distinguishing students, and so the ranking is not consistent between different lectures. This would tend to randomize positions as well.

Another issue I take with this method of combating grade inflation is that it completely ignores whether students learned anything or not. Since the grading is based on a way to rank students, even if a lecturer is ineffective and thus the students in the course don't learn very much, the student's score will be relatively unchanged. Now, it certainly seems unfair for a student to get a bad grade because their lecturer is poor, but it seems like any top university should not rehire anyone who teaches so poorly that their students learn very little (though I know this is wishful thinking). In particular, an issue here is that how much of the course material students learned is an extremely hard factor to quantify without imposing standards. However, standardized testing leads to ineffective teaching methods (and teaching "to the test") and is clearly not the answer. I'm not aware of a good way to solve this problem, but I think taking data-driven approaches to study this would be extremely useful for education.

In my mind, instead of imposing fixed grade percentages for each of the classes, the grade percentages should be imposed on the course as a whole. That is, in the diagram above, ideally the upper and lower curves would be much closer to the grade in the "true ranking" scenario. Thus, luck or scheduling conflicts have much less of an effect on a students grade. Then the question becomes how to accomplish this. This would mean that sometimes classes would get 40% As and maybe sometimes 15% As, but it would be okay because this is the grade the students should get.

My training in machine learning suggests that bagging would be a great way to reduce the variance. This would mean having three different test problems on each topic and randomly assigning each student one of these three problems. Apart from the logistic nightmare this would bring about, this would really only work when one lecturer is teaching all the classes. For example, if one of the lecturers is much better than another or likes to do problems close to test problems in lecture, then the students will perform better relative to students in other lectures because of their lecturer. To make this work, there needs to be a way to "factor out" the effect of the lecturer.

Another method would be to treat grading more like high school and set rigid grade distributions. The tests would then have to be written in a way such that we'd expect the outcome of the test to follow the guideline grade distributions set by the university, assuming the students in the class follow the general student population. Notably, the test is not written so that the particular course will follow the guideline grade distribution. Of course this is more work than simply writing a test, and certainly, the outcome of a test is hard to estimate. Often I've given tests and been surprised at the outcome, though this is usually due to incomplete information, such as not knowing the instructor did an extremely similar problem as a test problem in class.

One way to implement this would be to look at past tests and look at similar problems, and see how students did on those problems. (Coincidentally, this wasn't possible to do until recently when we started using Gradescope). This gives an idea how we would expect students to perform, and we can use this data to weight the problem appropriately. Of course, we (usually) don't want to give students problems they'll have seen while practicing exams and so it is hard to define how similar a problem is. To do this right requires quite a bit of past data on tests, and as I mentioned earlier this isn't available. Similar problems given by other professors may help, but then we run into the same problem above in that different lecturers will standardize differently from how they decide to teach the course.

Without experimenting with different solutions, it's impossible to figure out what the best solution is, but it seems crazy to accept that curving classes is the best way. Through some work, there could be solutions that encourage student collaboration, reward students for doing their homework (I hope to write more on this in the future) instead of penalizing them for not doing their homework, and take into account how much students are actually learning.

Code for the figure is available here.

In this post, I'll be looking at trends in exam responses of physics students. I'll be looking at the Gradescope data from a midterm that my students took in a thermodynamics and electromagnetism course. In particular, I was interested if things students get right correlate with the physics expectation. For example, I might expect students who were able to apply Gauss's law correctly to be able to apply Ampere's law correctly as the two are quite similar.

I'll be using nonnegative matrix factorization (NMF) for the topic modeling. This is a technique that is often applied to topic modeling for bodies of text (like the last blog post). The idea of NMF is to take a matrix with positive entries, $A,$ and find matrices $W$ and $H,$ also with positive entries, such that

Usually, $W$ and $H$ will be chosen to be low-rank matrices and the equality above will be approximate. Then, a vector in $A$ is now expressed as the positive linear combination of the small number of rows (topics) of $W.$ This is natural for topic modeling as everything is positive, meaning that cancellations between the rows of $W$ cannot occur.

The data

For each student and for each rubric item, Gradescope stores whether the grader selected that item for the student. Each rubric item has points associated with it, so I use this as the weight of the matrix to perform the NMF on. The problem, though, is that some rubric items correspond to points being taken off from the student, which is not a positive quantity. In this case, I took a lack of being penalized to be the negative of the penalty, and those that were penalized had a 0 entry in that position of the matrix.

There was also 0 point rubric items (we use these mostly as comments that apply to many students). I ignore these entries. But finding a way to incorporate this information could also be interesting.

Once the matrix is constructed, I run NMF on it to get the topic matrix $W$ and the composition matrix $H.$ I look at the entries in $W$ with the highest values, and these are the key ideas on the topic.

Results

The choice of the number of topics (the rank of $W$ and $H$ above) was not obvious. Ideally, it would be a small number (like 5) so it would be easy to just read off the main topics. However, this seems to pair together some unrelated ideas by virtue of them being difficult (presumably because the better students did well on these points). Another idea was to look at the error $\| A - WH\|_2$ and to determine where it flattened out. As apparent below, this analysis suggested that adding more topics after 20 did not help to reduce the error in the factorization.

With 20 topics, it was a pain to look through all of them to determine what each topic represented. Further, some topics were almost identical. One such example was a problem relating to finding the work in an adiabatic process. Using the first law of thermodynamics and recognizing the degrees of freedom were common to two topics. However, one topic had to be able to compute the work correctly, as the other one did not. This is probably an indication that the algebra leading up to finding the work was difficult for some. I tried to optimize these problems and ultimately chose 11 topics, which seems to work reasonably well.

Some "topics" are topics simply by virtue of being worth many points. This would be rubric items with entries such as "completely correct" or "completely incorrect." This tends to hide the finer details that in a problem (e.g. a question testing multiple topics, which is quite common in tests we gave). These topics often had a disproportionate number of points attributed to them. Apart from this, most topics seemed to have roughly the same number of points attributed to them.

Another unexpected feature was that I got a topic that negatively correlated with one's score. This was extremely counter-intuitive as in NMF each topic can only positively contribute to score, so having a significant component in a score necessarily means having a higher score. The reason this component exists is that it captures rubric items that almost everyone gets right. A higher scoring student will get the points in these rubric items from other topics that also contain this rubric item. Most of the other topics had high contributions from rubric items that fewer than 75% of students obtained.

Many topics were contained within a problem, but related concepts across problems did cluster as topics. For example, finding the heat lost in a cyclic process correlated with being able to equate heat in to heat out in another problem. However, it was more common for topics to be entirely contained in a problem.

The exam I analyzed was interesting as we gave the same exam to two groups of students, but had different graders grade the exams (and therefore construct different rubrics). Some of the topics found (like being able to calculate entropy) were almost identical across the two groups, but many topics seemed to cluster rubric items slightly differently. Still, the general topics seemed to be quite consistent between the two exams.

The plots show a student's aptitude in a topic as a function of their total exam score for three different topics. Clearly, depending on the topic the behaviors can look quite different.

Looking at topics by the student's overall score has some interesting trends as shown above. As I mentioned before, there are a small number (1 or 2) topics which students with lower scores will "master," but these are just the topics that nearly all of the students get points for. A little over half the topics are ones which students who do well excel at, but where a significant fraction of lower scoring students have trouble with. The example shown above is a topic that involves calculating the entropy change and heat exchange when mixing ice and water. This may be indicative of misconceptions that students have in approaching these problems. My guess here would be that students did not evaluate an integral to determine the entropy change, but tried to determine it in some other way.

The rest of the topics (2-4) were topics where the distribution of points was relatively unrelated to the total score on the exam. In the example shown above, the topic was calculating (and determining the right signs) of work in isothermal processes, which is a somewhat involved topic. This seems to indicate that success in this topic is unrelated to understanding the overall material. It is hard to know exactly, but my guess is that these topics test student's ability to do algebra more than their understanding of the material.

I made an attempt to assign a name to each of the topics that were found by analyzing a midterm (ignoring the topic that negatively correlated with score). The result was the following: heat in cyclic processes, particle kinetics, entropy in a reversible system, adiabatic processes, work in cyclic processes, thermodynamic conservation laws, particle kinetics and equations of state, and entropy in an irreversible system. This aligns pretty well with what I would expect students to have learned by their first midterm in the course. Of course, not every item in each topic fit nicely with these topics. In particular, the rubric items that applied to many students (>90%) would often not follow the general topic.

Ultimately, I was able to group the topics into four major concepts: thermodynamic processes, particle kinetics and equations of state, entropy, and conservation laws. The following spider charts show various student's abilities in each of the topics. I assumed each topic in a concept contributed equally to the concept.

Aptitude in the four main concepts for an excellent student (left) an average student (middle) and a below average student (right).

Conclusions

Since the data is structured to be positive and negative (points can be given or taken off), there may be other matrix decompositions that deal with the data better. In principle, this same analysis could be done using not the matrix of points, but the matrix of boolean (1/0) indicators of rubric items. This would also allow us to take into account the zero point rubric items that were ignored in the analysis. I do not know how this would change the observed results.

I had to manually look through the descriptions of rubric items that applied to each topic and determine what the topic being represented was. An exciting (though challenging) prospect would be to be able to automate this process. This is tricky, though, as associations that $S$ and entropy are the same could be tricky. There may also be insights from having "global" topics across different semesters of the same course in Gradescope.

The code I used for this post is available here.

Amtrak and Survival Analysis

I got the idea for this blog post while waiting ~40 minutes for my Amtrak train the other week. While I use Amtrak a handful of times a year, and generally appreciate it I do find it ridiculous how unreliable its timing can be. (This is mostly due to Amtrak not owning the tracks they use, but I will not get into that here). Most of the delays I've experienced lately haven't been too bad (40 minutes is on the high end), but when I lived in North Carolina and was often taking the Carolinian train from Durham to Charlotte, the story was quite different. I can't remember a single time when the train arrived on time to Durham, and often the delays were over an hour.

This brings me to the point of this post, which is to answer the question, when should I get to the Durham Amtrak station if I want to catch the Carolinian to Charlotte? I'll assume that real-time train delay data isn't available so that past information is all I have to go off of. Certainly if all the trains are actually an hour late, I might as well show up an hour past the scheduled time and I would still always catch the train. Amtrak claims the Carolinian is on time 20-30% of the time, so presumably showing up late would make you miss about this many trains.

Fig. 1: Delay of arrival of the 79 train to the Durham station for each day since 7/4/2006 (with cancelled trains omitted). Note that in the first year and a half of this data, there are almost no trains that arrive on time, but the situation has improved over the years.

All Amtrak arrival data since 7/4/2006 is available on this amazing website. I got all the data available for the 79 train arriving to the Durham station. I've plotted the arrival times during this time in Fig. 1.

A simple frequentist approach

I can consider each train trip as an "experiment" where I sample the distribution of arrival times to the Durham station. The particular train I take is just another experiment, and I would expect it to follow the available distribution of arrivals. Thus, the probability of me missing the train if I arrive $\tau$ minutes after the scheduled time is

Where $N(t>t')$ counts the number of arrivals in the available data where the arrival $t'$ is greater than the specified $\tau.$ The question, then, is how much of the data to include in $N(t>t').$ To test this, I considered a half year's worth of data as a test set. Then, I figured out how much of the previous data I should use as my training set to most accurately capture the delays in the test set. I found that using a year of data prior to the prediction time worked the best. The method is not perfect; the percentage of missed trains predicted using the training set is about 10% off from the number in the test set, as there are year-to-year variations in the data.

A plot of $p(\tau)$ using all the available data and only using the last year of data is shown in Fig. 2. Using only the last year to build the model, to have an 80% chance of making the train, one can show up about 20 minutes after the scheduled time. This also confirms Amtrak's estimate that their trains are on time 20-30% of the time. Even if one shows up an hour after the scheduled time, though, he or she still has a 36% chance of making the train!

Fig. 2: $p(\tau)$ determined using all of the available data (blue) and only the last year of data (gold). I see that for delays longer than 60 minutes, the two curves are similar, indicating that for long waits either prediction method would give similar results. It appears that in the last year the shorter delays have been worse than the long-term average, as there are significant discrepancies in the curves for shorter delays.

A Bayesian approach

With a Bayesian approach, I would like to write down the probability of a delay, $\delta,$ given the data of past arrivals, $\mathcal{D}.$ I will call this $p(\delta|\mathcal{D}).$ Suppose I have a model, characterized by a set of $n$ unknown parameters $\vec{a}$ that describes the probability of delay. I will assume all the important information that can be extracted from $\mathcal{D}$ is contained in $\vec{a}.$ Then, I can decompose the delay distribution as

Using Bayes theorem, $p(\vec{a}|\mathcal{D})$ can then be expressed as

Here, $p(\mathcal{D}|\vec{a})$ is the likelihood function (the model evaluated at all of the data points), $\pi(\vec{a})$ is the prior on the model parameters, and $p(\mathcal{D})$ is the evidence that serves as a normalization factor.I use non-informative priors for $\pi(\vec{a}).$

The question is, then, what the model should be. A priori, I have no reason to suspect any model over another, so I decided to try many and see which one described the data best. To do this, I used the Bayes factor, much like I used previously, with the different models representing different hypotheses. The evidence for a model $\mathcal{M}_1$ is $f$ times greater than the evidence for a model $\mathcal{M}_2$ where

As the models are assumed to depend on parameters $\vec{a}$ (note that a method that does not explicitly have a functional form, such as a machine learning method, could still be used if $p(\mathcal{D}|\mathcal{M})$ could be estimated another way)

Here, $\delta_i$ are all of the delays contained in $\mathcal{D}.$ This integral becomes difficult for large $n$ (even $n=3$ is getting annoying). To make it more tractable, let $l(\vec{a}) = \ln(p(\mathcal{D}|\vec{a})),$ and let $\vec{a}^*$ be the value of the fit parameters that maximize $l(\vec{a}).$ Expanding as a Taylor series gives

where $H$ is the matrix of second derivatives of $l(\vec{a})$ evaluated at $\vec{a}^*.$ The integral can be evaluated using the Laplace approximation, giving

which can now be evaluated by finding a $\vec{a}^*$ that maximizes $p(\mathcal{D}|\vec{a}).$ (Regular priors much be chosen for $\pi(\vec{a}^*|\mathcal{M})$ since I have to evaluate the prior. I will ignore this point here). I tested the exponential, Gompertz, and Gamma/Gompertz distributions, and found under this procedure that the Gamma/Gompertz function described the data the best under this metric. Using this, I explicitly calculate $p(\delta|\mathcal{D}),$ again under the Laplace approximation. This gives the curve shown in Fig. 3, which, as expected, looks quite similar to Fig. 2.

While this section got a bit technical, it confirms the results of the earlier simple analysis. In particular, this predicts that one should show up about 17 minutes after the scheduled arrival time to ensure that he or she will catch 80% of trains, and that one still has 30% chance of catching the train if one shows up an hour late to the Durham station.

Fig. 3: $p(\delta|\mathcal{D})$ calculated using only the last year of data. Note that this curve is quite similar to Fig. 2.

Conclusions

Since I downloaded all the data, 5 days have passed and in that time, the delay of the 79 train has been 22, 40, 51, 45, and 97 minutes. It's a small sample size, but it seems like the prediction that 80% of the time one would be fine showing up 20 minutes late to the Durham station isn't such a bad guideline.

Of course, both of these models described are still incomplete. Especially with the frequentist approach, I have not been careful about statistical uncertainties, and both methods are plagued by systematic uncertainties. One such systematic uncertainties is that all days are not equivalent. Certain days will be more likely to have a delay than other. For example, I am sure the Sunday after Thanksgiving almost always has delays. No such patterns are taken into account in the model, and for a true model of delays these should either be included or the effect of such systematic fluctuations could be characterized.

Scripts for this post are here.

Predicting Fires

While organizing a data science workshop this summer, I realized that I hadn't ever written a blog post about the data science project I worked on last year. So, this post will be a summary of what I learned from working on the project.

The problem I worked on was Kaggle's Fire Peril Loss Cost. Given over 300 features, including weather and crime, for over one million insurance policies, we wanted to predict how much the insurance company would lose on a policy due to a fire. This is tricky as fires are rare events, and thus almost all policies have no loss.

Features
We first implemented one-hot encoding to turn the categorical variables into boolean arrays indicating whether a certain category applied to a policy or not. This is better than assigning a number in succession to each category (e.g. Category 1=0, Category 2=1, Category 3=2,...) as doing this will assign a hierarchy/metric on the data, which could create spurious relations. One this encoding was done, all of our data was in a numerical form that was amenable for use in machine learning algorithms.

With so many features on so many policies, the entire dataset would not fit in memory on my laptop. Also, we found that not all 300 features were good predictors of the target value. Thus, we spent time selecting the most important features to make predictions.

There were a manageable amount of categorical variables, so we kept all of these, though we did try removing some of them to see if there was any performance benefit from doing so. For each continuous variable, we tried a model where the prediction was simply the value of the variable. Since the evaluation metric (a weighted gini index) only depended on the ordering of the predictions and not the magnitudes, this analysis method was amenable for all of the continuous variables. Notably, we found that one of the variables (var13) was already a reasonably good predictor of the target. We kept the 30 continuous features that scored best under this metric. We chose this selection method over other more common feature selection methods (such as PCA) to avoid some of the stability issues associated with them, but it may have been interesting, given more time, to see how various feature selection methods fared with one another.

Machine Learning
Since only ~0.3% of policies had any loss, we considered using a classifier to first identify the policies with loss. Ideally, the rate of policies with loss identified by this classifier as not having loss would be sufficiently low. Then a regressor could be used on the policies identified by the classifier as having loss, and then the training set would be less singular. We tried a few classifiers, but did not have much success with this approach.

We then tried to use regressors directly. We tried many of the machine learning regressors available in scikit-learn. We found good results from ridge (Tikhonov) regression and gradient boosted decision trees. In the end, we ended up combining the predictions of the two methods, which will be discussed a bit later.

Ridge regression is similar to standard linear regression, but instead of just minimizing the 2-norm of the vector of residuals, there is a penalty term proportional to the two-norm of the vector of coefficients multiplying the features. This penalty for feature coefficients ensures that no feature coefficients become too large. Large feature coefficients could be a sign of a singular prediction matrix and thus could fluctuate wildly. We found through testing that the optimal constant in front of the 2-norm of the vector of feature coefficients was quite small, so the ridge regression was acting quite similarly to linear regression. Note that with ridge regression it is important that all the features are normalized, as not doing this affects the size of the 2-norm of the vector of feature coefficients.

A decision tree is a set of rules (decisions) used to group policies into different classes. These rules are simple ones such as "is var13>0.5?" These rules are chosen at each step so that they best split the set of items (the subsets should mostly all be of the same values). There are different notions for what best means here, but using the gini impurity or information gain (entropy) are common choices. With enough rules, given a grouping, each policy in the group will all have similar target values. A new policy with an unknown target can then be compared against these rules and the target can be predicted to take on the value of the target in the group of policies it ends up with. Note that the depth of the tree (the number of rules) needs to be limited so that the method does not overfit. One could come up with enough rules such that each policy is its own group, but then the method loses predictive power for a new policy with an unknown target value.

On its own, the decision tree is not great at regression. The power of the method comes from the "gradient boosting" part of the name. After a decision tree is created, there will invariably be some policies that are misclassified. In analogy to gradient descent, the decision tree is then then trained to fit the residuals as the new target variable (or more generally the negative gradient of the loss function). This corrects for errors made at each iteration, and after many iterations, makes for quite a robust regressor and classifier.

We got good performance from each of these methods, and the two methods arrive at the prediction in very different ways. Thus, we combine the results from these two machine learning methods to arrive at our final prediction. We considered a standard mean as well as a geometric mean for the final prediction. We found that the geometric combination was more useful. This seems reasonable in predicting rare events as then both methods have to agree that the prediction value is large to net a large prediction, whereas only one method has to have a small prediction value to net a small prediction.

Other Things
We probably could have dealt with missing entries better. It turned out that many of the features were strongly correlated with other features (some even perfectly) so we could have used this information to try to fill in the missing features. Instead, we filled all missing entries with a value of 0. In general, it's probably best to treat missing features as a systematic error and the effect could be quantified through cross-validating by considering various scenarios of filling in missing entries.

It also turned out that some of the features that were labeled as continuous were not actually continuous and were discrete (there were only a few values that the continuous variable took). There may have been some performance benefit from implementing one-hot encoding on these as well.

For the ridge regression, we could have applied standard model selection methods such as AIC and BIC to choose the key features. For the gradient-boosted decision trees, using these methods is a bit trickier as the complexity of the fit is not easy to determine.

One lesson I learned was the importance of cross-validation. $k$-fold cross-validation randomly splits up the training set into $k$ subsets. Then, each subset is used as a test set with the complement used as the training set. This gives an idea of how well the model is expected to perform and also how much the model may be expected to overfit. The cross-validation estimate of the error will be an overestimate of the true prediction error, since only a subset of the data is used for prediction, whereas the whole dataset would be used for a true prediction [1]. Ideally, one is in a regime where this difference is not crucial. By adjusting the value of $k,$ a good regime where this is true can be found.

While we did split up our data set into the halves and test by training on one half and predicting on the other, we could have been more careful about the process. In particular, the feature selection process could have been carefully verified. In addition, we trusted our position on the Kaggle leaderboards more than our cross-validation scores, which led our final predictions to overfit to the leaderboard more than we would have liked.

References:
1. Hastie, T., Tibshirani, R., and Friedman, J. 2009. The Elements of Statistical Learning.

Hypothesis Testing in a Jury

I recently served on a jury and was quite surprised as how unobjective some of the other jurors were being when thinking about the case. For our case, it turned out not to matter because the decision was obvious, but it got me thinking about a formal reasoning behind "beyond a reasonable doubt." This reasoning will involve more statistics than physics, but considering I've been thinking about Bayesian analyses recently in my research, it's quite appropriate.

At the most basic level, a jury decision is a hypothesis test. I wish to distinguish between the hypotheses of not guilty (call it $\mathcal{H}_0,$ since the defendant is innocent until proven guilty) and guilty (call it $\mathcal{H}_1$). In Bayesian statistics, the way to compare two hypotheses is by computing the ratio of posterior probabilities.

Where $p(\mathcal{H}|D)$ is the probability of the assumption $\mathcal{H}$ given the available data (the evidence). This probably does not seem obvious to compute, and I'll discuss later how one might determine values for these. If $F=2,$ then the hypothesis $\mathcal{H}_1$ is twice as likely as the hypothesis $\mathcal{H}_0$. Thus, if $F \gg 1,$ then the evidence for $\mathcal{H}_1$ is overwhelming. What this means in terms of "beyond a reasonable doubt" is debatable, but it is generally accepted that if $F \gtrsim 100,$ there is strong evidence for $\mathcal{H}_1$ over $\mathcal{H}_0$ [1]. Similarly, if $F\ll 1,$ then the evidence for $\mathcal{H}_0$ is overwhelming. Thus, the question or determining guilt or not guilt is equivalent to calculating $F.$

$p(\mathcal{H}|D)$ can be rewritten using Bayes theorem as

Here, $p(D|\mathcal{H})$ is the probability of the evidence given the assumption of guiltiness (or not-guiltiness), which is more tractable than $p(\mathcal{H}|D)$ itself. Note the prosecutor's fallacy can be thought of as confusing $p(D|\mathcal{H})$ with $p(\mathcal{H}|D).$ $p(\mathcal{H})$ is the prior, which takes into account how much one believes $\mathcal{H}$ with regard to other hypotheses. $p(D)$ is a normalization factor to ensure probabilities are always less than 1. In the relation for $F,$ this cancels out, so there is no need to worry about this term. With this replacement, the ratio of posterior probabilities becomes

The first ratio is called the Bayes factor. The second ratio quantifies the ratio of prior beliefs of the hypotheses. The ratio is, given no other information, the odds that the defendant is guilty. Suppose I accept that there was a crime committed, but that the identity of the criminal is in question. If there is only one person that committed the crime, this would then be the inverse of the number of people who could have committed the crime.

Now, I will consider the calculation of the Bayes factor for a real trial. Consider R v. Adams, which set a precedent for banning explicit Bayesian reasoning in British Courts in the context of DNA evidence. It was estimated during the trial that there were roughly 200,000 men in the age range 20-60 who could have committed a crime. Note that some extra assumptions on age and gender seem to be made here, so this does not seem applicable to the ratio of prior beliefs. However, if I lifted these restrictions, the Bayes factor for the victim's misidentification of the defendant would change accordingly, so this is not a concern.

First, consider the DNA evidence that was the only piece of evidence incriminating the defendant. Call this evidence $D_0.$ $p(D_0|\mathcal{H}_1)$ is the probability of a positive DNA match under the assumption that the defendant is guilty. This is presumably extremely close to 1, or DNA evidence would not be considered good evidence in trials. $p(D_0|\mathcal{H}_0)$ is the probability of a positive match if the defendant is not guilty. Taking into account the population of the U.K., this was estimated in the trial to be between 1 in 2 million and 1 in 200 million (though possibly as low as 1 in 200 since the defendant had a half-brother) [2]. Thus, the Bayes factor considering only the DNA evidence, is between 2 million and 200 million. With the 1 in 200,000 prior probability, the posterior probability ratio $F$ is between 10 and 1000. Only the higher end of this range is overwhelming evidence, and in the case of conflicting evidence, the jury is supposed to give the benefit of the doubt to the defendant, so it seems a "not guilty" verdict would have been appropriate.

Further, this ignores all of the other evidence that helped to prove the defendant's innocence. This included the victim failing to identify the defendant as the attacker and the defendant having an alibi for the night in question. Let us call these two pieces of evidence $D_1$ and $D_2.$ Unlike the DNA evidence, the witnesses do not explicitly mention what the relevant probabilities are for this evidence, so it is up to the jurors to make reasonable estimates for these quantities. $p(D_1|\mathcal{H}_1)$ is the probability the victim fails to identify the defendant as the attacker given the defendant's guilt. Set this to be around 10%, though police departments may actually have statistics on this rate. On the other hand, $p(D_1|\mathcal{H}_0),$ the probability the victim fails to identify the defendant as the attacker given the defendant is not guilty is high, say around 90%. Thus, the Bayes factor considering the victim's failed identification of the defendant is about 1 in 10. Note that even if these numbers change by 10% this factor doesn't change in order of magnitude, so as long as a reasonable estimate is made for this factor, it doesn't really matter what the actual value is. The alibi is less convincing. Though the defendant's girlfriend testified, the defendant and the girlfriend could have confirmed their story with each other. Thus, I estimate the Bayes factor for the alibi $p(D_2|\mathcal{H}_1)/p(D_2|\mathcal{H}_0)$ to be about 1 in 2.

Since all these pieces of evidence are independent, $p(D_0,D_1,D_2|\mathcal{H})=p(D_0|\mathcal{H})p(D_1|\mathcal{H})p(D_2|\mathcal{H}).$ Thus, the Bayes factor for all evidence is between 100,000 and 10,000,000. Now, multiplying by the prior probability, this gives a posterior probability ratio, $F,$ between 0.5 and 50. With the new evidence taken into account, there is no longer strong evidence that the defendant is guilty even in the best case scenario for the prosecution. Convicting someone with a posterior probability ratio of 50 would falsely convict people 2% of the time, which seems like an unacceptable rate if one is taking the notion of innocent until proven guilty seriously. Note that as long as the order of magnitude of each of the Bayes factor estimates doesn't change, the final result will also not change by more than 1 order of magnitude, so the outcome is fairly robust.

While this line of logic was presented to the jurors during the trial, the jurors still found the defendant guilty. The judge took objection to coming out with a definite number for the odds of guilt when the assumptions going into it are uncertain, though as argued before, for any reasonable choice, these numbers cannot change too much [2]. It seems that without formal training in statistics, it was difficult to accept these rules as "objective," even though this is a provably, well-defined, mathematical way to arrive at a decision. If common sense and the rules of logic and probability are really what jurors are considering to reach their decision, this has to be the outcome that they reach. [3] argues that to not believe the outcome of a Bayesian argument like this would be akin to not believing the result of a long division calculation done on a calculator.

The most common objection to Bayesian reasoning (though apparently not the one the judges in R v. Adams had) is that the choice of prior can be somewhat arbitrary. In the example above, in the estimation of people that could have committed the crime, I could take people who were on the same block, the same neighborhood, the same city, or maybe even the same state. Each of these would certainly give different answers, so care must be taken to choose appropriate values for the prior. This doesn't make the method wrong or unobjective. It is just that the method cannot start with no absolutely no assumptions. Given a basic assumption, though, it provides a systematic way to see how the basic assumption changes when all the evidence is considered.

This problem seems to stem from a misunderstanding of basic statistics by the jury, attorneys, and judges. Another example of this is the U.K. Judge Edwards-Stuart who claimed that putting a probability on an event that has already happened is "pseudo-mathematics" [4]. This just shows the judge's ignorance, as this is precisely the type of problem Bayesian inference can explain. It is a shame that in the U.K. Bayesian reasoning is actively discouraged due to R v. Adams, as this is the only rigorous way to deal with these types of problems. I wasn't able to find any specific examples in the U.S., but I assume the "fear" of Bayesian statistics in courts here is similar to the case in the U.K.

References
1. Jeffreys, H. 1998. The Theory of Probability.
2. Lynch, M. and McNally, M., 2003. “Science,” “common sense,” and DNA evidence: a legal controversy about the public understanding of science. Public Understanding of Science, 12-1, 83-103.
3. Fenton, N. and Neil, M., 2011. Avoiding Probabilistic Reasoning Fallacies in Legal Practice using Bayesian Networks. Austl. J. Leg. Phil., 36, 114-151.
4. Nulty and Others v. Milton Keynes Borough Council, 2013. EWCA Civ 15.

Ferguson and Game Theory

Note: In this post I use words like "player," "game," and "payout" to refer to the interactions between criminals and police officers. I do not mean to trivialize deadly encounters between criminals and police officers and I am not implying that police officers are playing a game at their jobs. These are the terms that are used in discussions of game theory and to stay consistent with this, I use these terms.

In this blog, I usually talk about how ideas of physics apply to studying problems "outside physics". As far as I know, game theory isn't too useful in physics, but physicists have contributed to the theory of it as in [1]. Game theory is an interesting way to mathematically model interactions of people.

In a game theory game such as the prisoner's dilemma, each player has a choice of cooperating or defecting with the opponent. If the player cooperates while the other cooperates, the two players both get a mid sized reward. If one player cooperates while the other defects, that player gets nothing while the opponent wins a large reward. In the player defecting, opponent defecting case, the converse occurs. If both players defect, they both get a small reward. While seemingly simple, there is a rich theory to games like these. In particular, the iterated prisoner's dilemma (where this game is repeated many times) is used to study population dynamics.

In a game, the rewards (or payouts as they will be referred to later) are characterized by a payout matrix. The payout matrix shows, for each player, what his or her payout will be given each of the possible outcomes.

Fig. 1: Examples of payout matrices for each player in the prisoner's dilemma. If both players cooperate, they both get a payout of 3. If both players defect, they both get a payout of 1. If one player defects while the other cooperates, the defector gets 5 while the cooperator gets 0. Note that for the prisoner's dilemma, the game looks the same for each player. In the game I consider below, this is not the case.

We will apply this type of game theory logic to whether a police officer should shoot a criminal or not. This game is more complicated than the classical prisoner's dilemma as the payoff is not the same to either player (police officer or criminal), even if we assume each player values their life the same. Further, the criminal's payoff matrix is certain, while the police officer's payoff matrix has some uncertainty.

The scenario I am imagining is the following, inspired by the Ferguson incident. A criminal has committed a crime and is being chased down by a (or a pair of) police officer(s). I will take cooperating to mean that one player does not attack the other player, and defecting to mean that one player does attack the other player. I assume that if a player has a gun and defects, the other player dies, though most of the following argument would also follow if instead of dying the player is severely injured. For payouts, I assume a hierarchy where the payout for nothing happening >> the payout for arrest, suspension >> the payout for death (severe injury). I will take specific values of 0,-100,-10000 for these payouts mostly to make the arguments easier to understand, but the general results should hold regardless of the specific values chosen, as long as this hierarchy is in place. I assume that the game is played only once between the criminal and police officer(s), but it would be possible to think of this as an iterated game, where both players have to assess at each stage how much danger they are in and determine to defect or cooperate.

The payout matrix for the law enforcement officer is uncertain. The payout looks like the following, where the top consequences are the payouts when the criminal has a gun (or a lethal weapon) and the bottom are consequences when the criminal does not.

Fig 2: Payout matrix for the law enforcement officer. The top consequences are the consequences when the criminal has a lethal weapon, whereas the bottom consequences are when the criminal does not.

I would like to assign numerical values to these scenarios. In order to average the two scenarios (criminal having a lethal weapon and not) I will assume a prior distribution. This would be the fraction of criminals that are arrested by police officers that are carrying lethal weapons. This prior distribution presumably depends on the type of the crime the criminal committed, but I will not take this factor into account in the simple model I am considering. I was not able to find information about this distribution easily (though I hope this information is available to law enforcement officers) so I assume that 50% of criminals are carrying lethal weapons. However, I expect that this is an overestimate.

Fig 3: Numerical values for the payout matrix for the law enforcement officer. The values are obtained by averaging the payouts of the each of the consequences weighted by the probability that the criminal has a lethal weapon.

Thus, if the law enforcement officer knows that the criminal is cooperating, the optimal thing for him or her to do is to cooperate as well, as this is what maximizes the payout. Unfortunately, there is a chance the criminal will defect. In this case, the police officer should still cooperate, as there is a chance the criminal does not have a lethal weapon. Then, the officer has no consequences, which is better than facing suspension. Note, however, that if the criminal is defecting, the difference in payout for the two strategies of the police officer is quite small (0.1% with the numbers I have chosen). The two strategies are almost equivalent to the police officer.

Now consider the criminal. The payout matrix for the criminal is certain, as it is common knowledge that law enforcement officers carry weapons. It would look like the matrices below.

Fig 4: Descriptive and numerical payout matrices for the criminal. We take the criminal defecting, police officer cooperating case to be -1000 as it is part way between -100 and -100000. This may be an underestimate as if a criminal manages to shoot a police officer, there is a chance that another police officer will shoot the criminal.

The interesting thing here is that if the police officer is defecting (which, recall is not such a bad strategy for the police officer if the police officer assumes the criminal will defect), there is no optimal strategy for the criminal. The criminal dies in both cases. Clearly then, as there is no strategy if the police officer defects, the criminal can only optimize the case when the police officer cooperates, meaning the criminal should cooperate. Thus, the Nash equilibrium is for the criminal and police officer to cooperate. This is presumably what happens most of the time. Criminals are arrested every day but we (comparatively) rarely hear about incidents where the criminal or police officer is killed. So does this mean game theory cannot explain what happened at Ferguson? Not quite.

Let's say the strategy of the police officer is to defect a fraction $x$ of the time and to cooperate $1-x$ of the time. Then the expected payout (probability multiplied by the payout of that case) for the criminal is $-100(1-x)-100000x$ if the criminal cooperates and $-1000(1-x)-100000x$ if the criminal defects. As an example, take $x$ to be $0.01$. At this value, the expected payout for the criminal is $-1099$ if he or she cooperates and $-1990$ if he or she defects. Clearly, the defecting strategy is still worse, but the difference between cooperating and defecting is only about 50%. The difference would be even less if the fraction $x$ were higher or if the payout of death were even more negative. Thus, if there's any reason for the criminal to think that the police officer may defect (and given news reports of incidents like Ferguson, there is probably reason for the criminal to expect this), the difference between cooperating and defecting is not so different in terms of payout. This means the actions of the criminal become unpredictable, which means the police officer may need to defect to preserve his or her life.

The main cause of this issue is that dying is really bad. Thus, if there's even a small chance that the police officer will defect, the criminal's choices become mostly equivalent. Now suppose police officers are only able to use non-lethal means against the criminal. Then the payout matrix for the criminal looks quite different.

Fig 5: Descriptive and numerical payout matrices for the criminal when the police officer cannot use lethal means against him or her. I choose a value of -200 for getting hurt, though it may not necessarily be this bad. If the criminal can prove that the police officer attacked him or her without just cause, the criminal may even receive some sympathy.

Now, if the criminal defects, the payout is -1000 no matter what, while if the criminal cooperates the expected payout can be anywhere between -100 and -200, depending on the police officer's strategy. The expected payout is no longer (approximately) degenerate as above. The clear choice here is to cooperate, as defecting is many times worse. If the criminals and police officers are acting rationally, then an assurance by law enforcement to not kill criminals seems to improve the rate that the criminals should cooperate. Of course, the problem with this is, that while the criminal's payoff matrix has changed by ensuring the police officer will not kill the criminal, the police officer's payoff matrix remains the same.

As mentioned earlier, unless the probability the criminal has a lethal weapon is extremely small, if the criminal has a chance to defect, the expected payout for the police officer is also nearly degenerate no matter the police officer's strategy. The police officer's situation could be improved though, if the chance of death of the police officer were smaller. This could include protective equipment and training to not get fatally wounded in these encounters. In addition, there is a crucial difference between the criminal and police officer in that the police officer has more experience with these encounters than the criminal does (though the police officer is disadvantaged in that he or she has less information about the opponent). Thus, it is imperative that the police officer is trained to make the right decisions in these tough situations. As stated before, this is presumably what happens most of the time when criminals get arrested, and thus the police officers are usually doing a good job. If the payoff to the police officer were truly degenerate, we would expect the police officers to defect about half of the time. However, there are incidents where the police officer unnecessarily perceives danger. The police officers should be aware aware of their risks so that tragedies do not occur.

The payoff matrix for the police officer also looks quite different based on the prior distribution (or expectation) of the criminal's likelihood to be carrying a lethal weapon. I assume I have overestimated the likelihood a criminal carries a lethal weapon, but with this information the police officer's "strategy" could be improved, and as I stated above, this could prevent people from unnecessarily dying.

Most of the analysis above has depended on the players of the game acting at least approximately rationally. It is an interesting question, though, how rationally we can expect the players in this game are playing. The criminal presumably is "on edge" from just having committed a crime and being chased down by police officers would probably not help the situation. In a study done with the game show Golden Balls, contestants were found to be more altruistic, in general, than a simple game theory analysis of rational players would predict [2]. This shows that even in scenarios where life or death is not on the line, people may not be expected to act rationally. With information about how criminals are likely to act, police officers can develop better strategies on how to deal with situations such as this one.

References
1. Press, W.H., 2012. Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent. Proc. Natl. Acad. Sci. 109-26, 10409–10413.
2. Van den Assem, M.J., 2012. Split or Steal? Cooperative Behavior When the Stakes Are Large. Management Science. 58-1, 2-20.

Entropy and least squares

While most of the papers I reference in this blog are relatively new, this article will discuss an old idea, described in some detail in [1]. Entropy has been used as a concept in information theory for over half a century. One definition of the thermodynamic entropy is

Here $k_B$ is the Boltzmann constant and $p_i$ is the probability that the system is in the $i$th microstate. The sum is over all of the microstates, which is one particular configuration for the system. For example, if I were calculating the entropy of a gas molecule in a box, a microstate would be the gas molecule with a particular position and momentum.

Since this is an equation with probability in it, it seems natural to extend the idea to other fields that deal with probability. This is the idea behind the Shannon entropy of information theory and statistics. This entropy is defined in a similar way. If one has a discrete probability distribution, $p(x_i),$ then the Shannon entropy is

Here the sum is over all possible values of $x_i.$

The second law of thermodynamics states than an isolated system will always try to maximize its entropy. Thus, if I want to determine what the equilibrium configuration of a system is, I can do this by maximizing the entropy of the system. This is the approach [2] takes in "deriving" thermodynamics from basic statistical mechanics. The entropy is maximized subject to the constraint that the distribution should have a certain energy or particle number, and the Lagrange multipliers enforcing the constraints turn out to be related to temperature and chemical potential.

A similar idea can be applied to statistics. Suppose we would like for a probability distribution to have average $\mu$ and standard deviation $\sigma,$ but have as little other information encoded as possible. In other words, given a set of data points $x_i,$ I would like to find $p(x_i),$ the assignment of probability to each of the $x_i,$ such that the average $x_i$ value will be $\mu$ and the standard deviation is $\sigma^2$. I would do this by maximizing the Shannon entropy (by setting $dS = 0$) subject to some constraints. These constrains are

By setting $dS=0,$ the form of the equation for $p(x_i)$ becomes

Here, the $\lambda_j$ are the Lagrange multipliers. I then plug this form into the constraint equations to solve for the values of $\lambda_j.$ This gives

where $Z,$ the partition function, is

So I see the partition function is a useful object in the statistical sense as well. We can't simplify this any further without knowing specific values for $x_i,$ but given this information it would be easy to solve for the values of the Lagrange multipliers.

This procedure produces some reasonable results. For example, if the set of points is the entire real axis and I would like to apply the constraints above (though I have to do things a little differently since this is a continuous case), the distribution this procedure gives turns out to be a Gaussian. Thus, the Gaussian is a distribution over the whole real line that has a set average and standard deviation but encodes as little other information as possible.

There is a notion of "relative entropy" that may not be as familiar to physicists (at least I had never heard of it). This is called the Kullback-Leibler (KL) divergence. This can be quantified as (this is actually the negative of the KL divergence)

The KL divergence of a distribution $P$ from a distribution $Q$ quantifies how much information is lost when $Q$ is used to approximate $P.$ This seems like a nice thing to consider in the context of regression. I will follow [3] to use this to show how to compare two fit models and determine which one is more robust.

Let me assume there is some true distribution $f(y)$ and I am approximating it by a function $g(y|x).$ Now consider the expected entropy in $x$ (here $E_x$ will denote the expectation value with respect to $x$). This is

Now suppose there were another model, $g'(y|x).$ I would like to consider whether $g$ or $g'$ describes $f$ better. I can look at this by looking at the difference in the expected entropy of the two.

I have made a measurement of $\ln(g(y|x))-\ln(g'(y|x))$ by performing the fit, as this is a ratio of log-likelihoods. Asymptotically, (look at [3] for details) this measurement will differ from the expected value by $2(k-k'),$ where $k$ and $k'$ are the number of parameters used in the fit of $g$ and $g',$ respectively. Correcting for this bias, the difference in entropies is

where $L$ and $L'$ are the likelihood functions of the model $g$ and $g'.$ Thus, while the logic was a bit complicated, the difference in entropies shows us an amazingly simple result! All that I need to do to compare the quality of two models is to look at the difference of twice the log-likelihood and the number of fit parameters. This is the Akaike information criterion (AIC). The AIC is an extremely useful metric to decide whether a model is being over- or under-fitted!

References:
1. Jaynes, E.T., 1957. Information Theory and Statistical Mechanics. Physical Review 106-4, 620-630.
2. Pathria, R.K., 1972. Statistical Mechanics.
3. Akaike, H, 1985. Prediction and Entropy. A Celebration of Statistics 1, 1-24.