Predicting Elections from Pictures

This work was done by the amazing team I mentored during the CDIPS data science workshop.

I got the idea for this project after reading Subliminal by Leoonard Mlodinow. That book cited research suggesting that when people are asked to rate pictures of people based on competency, the average competency score of a candidate is predictive of whether the candidate will win or not. The predictions are correct about 70% of the time for senators and 60% for house members, so while not a strong indicator, there seems to be some correlation between appearance and winning senate races. So, we decided to use machine learning to build a model to assess senator faces in the same manner.

This is a challenge as there's only around 30 senate races every two years, so there's not much data to learn from. We also didn't want to use data that was too old since as trends in hair and fashion change, these probably affect how people perceive competence of people. We ended up using elections from 2007-2014 as training data to predict on the 2016 election. We got the senator images from Wikipedia and Google image search. We got images for the top two finishers, which is usually a democrat and republican. We didn't include other elections since the images were less readily available and we aren't sure if looking senatorial is the same as looking presidential (more on that later).

Interpreting Faces
We use a neural network to learn the relationships between pictures and likelihood of winning elections. As input, we provide a senator image with the label of whether the candidate won their race or not. Note that this means that our model predicts the likelihood of winning an election given that the candidate is one of the top two candidates in the election (which is usually apparent beforehand). The model outputs a winning probability for each candidate. To assess the winner of a particular election, we compare the probability of the two candidates and assume the candidate with the higher probability will win.

In order to cut down on training time, we used relatively shallow neural networks consisting of a few sets of convolutional layers followed by max pooling layers. After the convolutional layers, we used a fully connected layer before outputting the election win probabilities. Even with these simple networks, there are millions of parameters that must be constrained, which will result in overfitting with the relatively limited number of training images. We apply transformations including rotations, translations, blur, and noise to the images in order to increase the number of training images to make the training more robust. We also explored transfer learning, where we train the model using a similar problem with more data, and use that as a base network to train the senator model on.

We use keras with the tensorflow backend for training. We performed most of our training on floydhub, which offers reasonably priced resources for deep learning training (though it can be a little bit of a headache to set up).

Model Results
Ultimately, we took three approaches to the problem that proved fruitful:
(I) Direct training on senator images (with the image modifications).
(II) Transfer on senator images from male/female classifier trained on faces in the wild.
(III) Transfer on face images from vgg face (this is a much deeper network than the first two).
We compare and contrast each of these approaches to the problem.

The accuracy in predicting the winners in each state in 2016 for each model were respectively (I) 82%, (II) 74%, and (III) 85%. Interestingly, Florida, Georgia, and Hawaii were races that all the models had difficulty predicting, even though these were all races where the incumbent won. These results make model (III) appear the best, but the number of senate races in 2016 is small, so these results come with a lot of uncertainty. In addition, the training and validation sets are not independent. Many senators running in 2016 were incumbents who had run before, and incumbents usually win reelection, so if the model remembers these senators, it can do relatively well.

Screen Shot 2017-08-19 at 4.42.13 PM

The candidates from Hawaii, Georgia, and Florida that all models struggled with. The models predicted the left candidate to beat the right candidate in each case.

Validating Results
We explored other ways to measure the robustness of our model. First, we had each model score different pictures of the same candidate.

Screen Shot 2017-08-19 at 1.22.32 PMScreen Shot 2017-08-19 at 4.53.34 PM

Scores predicted by each of the models for different pictures of the same candidate. Each row is the prediction of each of the three models.

All of our models have some variability in predictions on pictures of the same candidate, so our model may benefit from learning on more varied pictures of candidates. We have to be careful, though, as lesser known candidates will have fewer pictures and this may bias the training. We also see that for model (III), the Wikipedia pictures actually have the highest score among all of the candidate images. Serious candidates, and in particular incumbents, are more likely to have professional photos and the model may be catching this trait.

We also looked at what features the model learned. First, we looked at how the scores changed when parts of the image were covered up. Intuitively, facial features should contribute most to the trustworthiness of a candidate.

Screen Shot 2017-08-19 at 1.24.10 PM

Scores predicted by each of the models for pictures where part of the image is masked. Each row is the prediction of each of the three models.

We find masking the images wildly changes the prediction for models (I) and (II) but not for model (III). It seems that for this model apparel is more important than facial features, as covering up the tie in the image changed the score more than covering up the face.

We also compared what the first convolutional layer in each of the models learned.

Screen Shot 2017-08-19 at 9.22.41 PM

Sample of outputs of the first convolutional layer after passing in the image on the far left as visualized by the quiver package. Each row shows the output for each of the three models. We see that each model picks up on similar features in the image.

This confirms that apparel is quite important, with the candidate's tie and suit being picked up by each of the models. The models do also pick up on some of the edges in facial features as well. A close inspection of the layer output shows that the output of model (III) is cleaner than the other two models in picking up these features.

Given all of these findings, we determine the most robust model is model (III), which was the model that predicted the most 2016 elections correctly as well.

Earlier, we mentioned we trained on senator data because we were not sure whether other elections had similar relationships between face and winning. We tested this hypothesis on the last three presidential elections. This is clearly an extremely limited dataset, but we find the model predicts only one of the elections correctly. Since presidential elections are so rare, training a model to predict on presidents is a challenge.

Screen Shot 2017-08-19 at 10.41.20 PM

Model (III) predictions on presidential candidates.

Our models were trained to give a general probability of winning an election. We ignore the fact that senator elections for the most part are head to head. There may be benefits from training models to consider the two candidates running for the election and having the model choose the winner. Ultimately, we would want to combine the feature created here with other election metrics including polls. This would be another large undertaking to figure out how to reliably combine results, but this may offer orthogonal insights to methods that are currently used to predict election results.

Check out the code for the project here.

Who will win Top Chef Season 14?

Warning: Spoilers ahead if you have not seen the first two episodes of the new season

In the first episode of the season Brooke, after winning the quickfire, claimed she was in a good position because the winner of the first challenge often goes on to win the whole thing. Actually, only one contestant has one the first quickfire and gone on to win the whole thing (Richard in season 8), and that was a team win. The winner of the first elimination challenge has won the competition 5 of 12 times (not counting season 9 when a whole team won the elimination challenge). This got me wondering if there were other predictors as to who would win Top Chef.

There's not too much data after the first elimination challenge, but I tried building a predictive model using the chef's gender, age, quickfire and elimination performance, and current residence (though I ultimately selected the most predictive features from the list). I used this data as features with a target variable of elimination number to build a gradient-boosted decision tree model to predict when the chefs this season would be eliminated. I validated the model on seasons 12 and 13 and then applied the model to season 14. I looked at the total distance between the predicted and actual placings of the contestants as the metric to optimize during validation. The model predicted both of these seasons correctly, but seasons 12 and 13 were two seasons where the winner of the first elimination challenge became top chef.

The most important features in predicting the winner were: elimination challenge 1 performance, season (catching general trends across seasons), gender, home state advantage, being from Boston, being from California, and being from Chicago. Male chefs do happen to do better as do chefs from the state where Top Chef is being filmed. Being from Chicago is a little better than being from California, which is better than being from Chicago. To try to visualize this better, I used these important features and performed a PCA to plot the data in two dimensions. This shows how data clusters, without any knowledge of the ultimate placement of the contestants.


A plot of the PCA components using the key identified features. The colors represent ultimate position of the contestants. Blue represents more successful contestants where red represents less successful contestants. The x direction corresponds mostly to first elimination success (with more successful contestants on the right) and the y direction corresponds mostly to gender (with male on top). The smaller spreads correspond to the other features, such as the contestant's home city. We see that even toward the left there are dark blue points, meaning that nothing is a certain deal-breaker in terms of winning the competition, but of course winning the first challenge puts you in a better position.

My prediction model quite predictably puts Casey as the favorite for winning it all, with Katsuji in second place. The odds are a bit stacked against Casey though. If she were male or from Chicago or if this season's Top Chef were taking place in California, she would have a higher chance of winning. Katsuji's elevated prediction is coming from being on the winning team in the first elimination while being male and from California. He struggled a bit when he was last on the show, though, so I don't know if my personal prediction would put him so high. Brooke, even though she thought she was in a good position this season, is tied for fifth place according to my prediction. My personal prediction would probably put her higher since she did so well in her previous season.

Of course there's only so much the models can predict. For one thing, there's not enough data to reliably figure out how returning chefs do. This season, it's half new and half old contestants. The model probably learned a bit of this, though, since the experienced chefs won the first elimination challenge, which was included in the model. One thing I thought about including but didn't was what the chefs actually cooked. I thought certain ingredients or cooking techniques might be relevant features for the predictive model. However, this data wasn't easy to find without re-watching all the episodes, and given the constraints of all the challenges, I wasn't sure these features would be all that relevant (e.g. season 11 was probably the only time turtle was cooked in an elimination challenge). Obviously, with more data the model would get better; most winners rack up some wins by the time a few elimination challenges have passed.

Code is available here.

Grade Inflation

I have been thinking a lot about teaching lately (maybe now that I will not be teaching anymore) and I hope to write a series of a few blog posts about it. My first post here will be on grade inflation, specifically whether curving is an effective way to combat it.

A popular method to combat grade inflation seems to be to impose a set curve for all classes. That is, for example, the top 25% of students get As, the next 35% get Bs and the bottom 40% gets Cs, Ds, and Fs (which is the guideline for my class). While this necessarily avoids the problem of too many people getting As, it can be a bit to rigid, which I will show below.

In the class I teach, there are ~350 students, who are spread among three lectures. I will investigate what effect the splitting of students into the lectures has on their grade. First, I will make an incredibly simple model where I assume there is a "true" ranking of all the students. That is, if all the students were actually in one big class, this would be the ordering of their grades in the course. I will assume that the assessments given to the students in the classes they end up in are completely fair. That is, if their "true" ranking is the highest of anyone in the class, they will get the highest grade in the class and if their "true" ranking is the second highest of anyone in the class they will get the second highest grade and so on. I then assign students randomly to three classes and see how much their percentile in the class fluctuates based on these random choices. This is shown below


The straight black line shows the percentile a student would have gotten had the students been in one large lecture. The black curve above and below it show the 90% variability in percentile due to random assignment.

We see that even random assignment can cause significant fluctuations, and creates variability particularly for the students in the "middle of the pack." Most students apart from those at the top and bottom could have their letter grade change by a third of a letter grade just due to how the classes were chosen.

Further, this was just assuming the assignment was random. Often, the 8 am lecture has more freshman because they register last and lectures at better times are likely to fill up. There may also be a class that advanced students would sign up for that conflicts with one of the lecture times of my course. This would cause these advanced students to prefer taking the lectures at other times. These effects would only make the story worse from what's shown above.

We have also assumed that each class has the ability to perfectly rank the students from best to lowest. Unfortunately, there is variability in how exam problems are graded and how good questions are at distinguishing students, and so the ranking is not consistent between different lectures. This would tend to randomize positions as well.

Another issue I take with this method of combating grade inflation is that it completely ignores whether students learned anything or not. Since the grading is based on a way to rank students, even if a lecturer is ineffective and thus the students in the course don't learn very much, the student's score will be relatively unchanged. Now, it certainly seems unfair for a student to get a bad grade because their lecturer is poor, but it seems like any top university should not rehire anyone who teaches so poorly that their students learn very little (though I know this is wishful thinking). In particular, an issue here is that how much of the course material students learned is an extremely hard factor to quantify without imposing standards. However, standardized testing leads to ineffective teaching methods (and teaching "to the test") and is clearly not the answer. I'm not aware of a good way to solve this problem, but I think taking data-driven approaches to study this would be extremely useful for education.

In my mind, instead of imposing fixed grade percentages for each of the classes, the grade percentages should be imposed on the course as a whole. That is, in the diagram above, ideally the upper and lower curves would be much closer to the grade in the "true ranking" scenario. Thus, luck or scheduling conflicts have much less of an effect on a students grade. Then the question becomes how to accomplish this. This would mean that sometimes classes would get 40% As and maybe sometimes 15% As, but it would be okay, because this is the grade the students should get.

My training in machine learning suggests that bagging would be a great way to reduce the variance. This would mean having three different test problems on each topic and randomly assigning each student one of these three problems. Apart from the logistic nightmare this would bring about, this would really only work when one lecturer is teaching all the classes. For example, if one of the lecturers is much better than another or likes to do problems close to test problems in lecture, then the students will perform better relative to students in other lectures because of their lecturer. To make this work, there needs to be a way to "factor out" the effect of the lecturer.

Another method would be to treat grading more like high school, and set rigid grade distributions. The tests would then have to be written in a way such that we'd expect the outcome of the test to follow the guideline grade distributions set by the university, assuming the students in the class follow the general student population. Notably the test is not written so that the particular course will follow the guideline grade distribution. Of course this is more work than simply writing a test, and certainly the outcome of a test is hard to estimate. Often I've given tests and been surprised at the outcome, though this is usually due to incomplete information, such as not knowing the instructor did an extremely similar problem as a test problem in class.

One way to implement this would be to look at past tests and look at similar problems, and see how students did on those problems. (Coincidentally, this wasn't possible to do until recently when we started using Gradescope). This gives an idea how we would expect students to perform, and we can use this data to weight the problem appropriately. Of course, we (usually) don't want to give students problems they'll have seen while practicing exams and so it is hard to define how similar a problem is. To do this right requires quite a bit of past data on tests, and as I mentioned earlier this isn't available. Similar problems given by other professors may help, but then we run into the same problem above in that different lecturers will standardize differently from how they decide to teach the course.

Without experimenting with different solutions, it's impossible to figure out what the best solution is, but it seems crazy to accept that curving classes is the best way. Through some work, there could be solutions that encourage student collaboration, reward students for doing their homework (I hope to write more on this in the future) instead of penalizing them for not doing their homework, and take into account how much students are actually learning.

Code for the figure is available here.

Topic Modeling and Gradescope

In this post I'll be looking at trends in exam responses of physics students. I'll be looking at the Gradescope data from a midterm that my students took in a thermodynamics and electromagnetism course. In particular, I was interested if things students get right correlate with the physics expectation. For example, I might expect students who were able to apply Gauss's law correctly to be able to apply Ampere's law correctly as the two are quite similar.

I'll be using nonnegative matrix factorization (NMF) for the topic modeling. This is a technique that is often applied to topic modeling for bodies of text (like the last blog post). The idea of NMF is to take a matrix with positive entries, A, and find matrices W and H, also with positive entries, such that

 A = WH.

Usually, W and H will be chosen to be low rank matrices and the equality above will be approximate. Then, a vector in A is now expressed as the positive linear combination of the small number of rows (topics) of W. This is natural for topic modeling as everything is positive, meaning that cancellations between the rows of W cannot occur.

The data

For each student and for each rubric item, Gradescope stores whether the grader selected that item for the student. Each rubric item has points associated with it, so I use this as the weight for the matrix to perform the NMF on. The problem, though, is that some rubric items correspond to points being taken off from the student, which is not a positive quantity. In this case, I took a lack of being penalized to be the negative of the penalty, and those that were penalized had a 0 entry in that position of the matrix.

There were also 0 point rubric items (we use these mostly as comments that apply to many students). I ignore these entries. But finding a way to incorporate this information could also be interesting.

Once the matrix is constructed, I run NMF on it to get the topic matrix W and the composition matrix H. I look at the entries in W with the highest values, and these are the key ideas in the topic.


The choice of the number of topics (the rank of W and H above) was not obvious. Ideally it would be a small number (like 5) so it would be easy to just read off the main topics. However, this seems to pair together some unrelated ideas by virtue of them being difficult (presumably because the better students did well on these points). Another idea was to look at the error \| A - WH\|_2 and to determine where it flattened out. As apparent below, this analysis suggested that adding more topics after 20 did not help to reduce the error in the factorization.

With 20 topics, it was a pain to look though all of them to determine what each topic represented. Further, some topics were almost identical. One such example was a problem relating to finding the work in an adiabatic process. Using the first law of thermodynamics and recognizing the degrees of freedom were common to two topics. However, one topic had being able to compute the work correctly, as the other one did not. This is probably an indication that the algebra leading up to finding the work was difficult for some. I tried to optimize between these problems and ultimately chose 11 topics, which seems to work reasonably well.

Some "topics" are topics simply by virtue of being worth many points. This would be rubric items with entries such as "completely correct" or "completely incorrect." This tends to hide the finer details that in a problem (e.g. a question testing multiple topics, which is quite common in tests we gave). These topics often had a disproportionate number of points attributed to them. Apart from this, most topics seemed to have roughly the same number of points attributed to them.

Another unexpected feature was that I got a topic that negatively correlated with one's score. This was extremely counter-intuitive as in NMF each topic can only positively contribute to score, so having a significant component in a score necessarily means having a higher score. The reason this component exists is that it captures rubric items that almost everyone gets right. A higher scoring student will get the points in these rubric items from other topics that also contain this rubric item. Most of the other topics had high contributions from rubric items that fewer than 75% of students obtained.

Many topics were contained within a problem, but related concepts across problems did cluster as topics. For example, finding the heat lost in a cyclic process correlated with being able to equate heat in to heat out in another problem. However, it was more common for topics to be entirely contained in a problem.

The exam I analyzed was interesting as we gave the same exam to two groups of students, but had different graders grade the exams (and therefore construct different rubrics). Some of the topics found (like being able to calculate entropy) were almost identical across the two groups, but many topics seemed to cluster rubric items slightly differently. Still, the general topics seemed to be quite consistent between the two exams.


The plots show a student's aptitude in a topic as a function of their total exam score for three different topics. Clearly, depending on the topic the behaviors can look quite different.

Looking at topics by the student's overall score has some interesting trends as showed above. As I mentioned before, there are a small number (1 or 2) topics which students with lower scores will "master," but these are just the topics that nearly all of the students get points for. A little over half the topics are ones which students who do well excel at, but where a significant fraction of lower scoring students have trouble with. The example shown above is a topic that involves calculating the entropy change and heat exchange when mixing ice and water. This may be indicative of misconceptions that students have in approaching these problems. My guess here would be that students did not evaluate an integral to determine the entropy change, but tried to determine it in some other way.

The rest of the topics (2-4) were topics where the distribution of points was relatively unrelated to the total score on the exam. In the example shown above above, the topic was calculating (and determining the right signs) of work in isothermal processes, which is a somewhat involved topic. This seems to indicate that success in this topic is unrelated to understanding the overall material. It is hard to know exactly, but my guess is that these topics test student's ability to do algebra more than their understanding of the material.

I made an attempt to assign a name to each of the topics that were found by analyzing a midterm (ignoring the topic that negatively correlated with score). The result was the following: heat in cyclic processes, particle kinetics, entropy in a reversible system, adiabatic processes, work in cyclic processes, thermodynamic conservation laws, particle kinetics and equations of state, and entropy in an irreversible system. This aligns pretty well with what I would expect students to have learned by their first midterm in the course. Of course, not every item in each topic fit nicely with these topics. In particular, the rubric items that applied to many students (>90%) would often not follow the general topic.

Ultimately, I was able to group the topics into four major concepts: thermodynamic processes, particle kinetics and equations of state, entropy, and conservation laws. The following spider charts show various student's abilities in each of the topics. I assumed each topic in a concept contributed equally to the concept.


Aptitude in the four main concepts for an excellent student (left) an average student (middle) and a below average student (right).


Since the data is structured to be positive and negative (points can be given or taken off), there may be other matrix decompositions that deal with the data better. In principle, this same analysis could be done using not the matrix of points, but the matrix of boolean (1/0) indicators of rubric items. This would also allow us to take into account the zero point rubric items that were ignored in the analysis. I do not know how this would change the observed results.

I had to manually look through the descriptions of rubric items that applied to each topic and determine what the topic being represented was. An exciting (though challenging) prospect would be to be able to automate this process. This is tricky, though, as associations that S and entropy are the same could be tricky. There may also be insights from having "global" topics across different semesters of the same course in Gradescope.

The code I used for this post is available here.

Natural Language Processing and Twitch Chat

This post will be about using natural language processing (NLP) to extract information from Twitch chat. I found the chat log of a popular streamer, and I'll be analyzing one day of chats. I'll keep the streamer anonymous just because I didn't ask his permission to analyze his chat.

On this particular day, the streamer had 88807 messages in his chat with 11312 distinct users chatting. This is an average of about 7.9 chats/user. However, this doesn't mean that most people actually post this much. In fact, 4579 (or about 40%) of users only posted one message. This doesn't take into account the people that never posted, but it shows that it is quite common for users to "lurk," or watch the stream without actively taking part in chat. The distribution of posts per user is shown below:

A histogram of the frequency of number of messages in chat (note the log scale). Almost all of the people in chat post less than 10 posts. The chat bots were not included here, so everyone represented in the plot should be an actual user.

Only 1677 (or about 15%) of users posted 10 or more posts in chat, but they accounted for 65284 messages (about 73.5%). This seems to imply that there may be some form of Pareto principle at work here.

What are people talking about?

I used tf-idf on the chat log to get a sense for common words and phrases. The tf in tf-idf stands for word frequency, and is, for each chat message, how many times a certain word appears in that chat message [1]. idf stands for inverse document frequency, and is, for each term, the negative log of the fraction of all messages that the term appears in. The idf is an indicator for how much information there is in a word. Common words like "the" and "a" don't carry much information. tf-idf multiplies the two into one index for each term in each chat message. The words with the highest tf-idf are then the most used words in chat. The following table shows some of the common words in chat


All Chatters           >10 Chat Messages           One Chat Message
lol lol game
kappa kappa lol
kreygasm kreygasm kappa
pogchamp pogchamp kreygasm
game game wtf
myd kkona stream
kkona dansgame followage

The words with highest score under tf-idf for all messages, those who post many messages, and those who only post one message.

Not surprisingly, the people who chat a lot have a similar distribution of words as all the messages (remember, they are about 73.5% of all the messages). Those who only had one message in chat are talking about slightly different things than those who chat a lot. There are a few interesting features, which I will elaborate on below.

myd and followage are bot commands on the streamer's stream. Apparently gimmicks like this are fairly popular, but this means that there are many people chatting without adding content to the stream. It is interesting that those that post more are far less likely to play with these bot commands.

On this day the streamer was playing random games that his subscribers had suggested. This led to weird games and thus many people commented on the game, hence the prevalence of words like "game" and "wtf". People who only post one message seem more likely to complain about the game than those who talk often. For words like this, it could be interesting to see how their prevalence shifts when different games are played.

For those not familiar with Twitch, kappa, kreygasm, pogchamp, kkona, and dansgame are all emotes. Clearly, the use of emotes is quite popular. kkona is an emote on BTTV (a Twitch extension), so it is quite interesting how many people have adopted its use, and this may also indicate why it is more popular with people who post more.

Who do people talk about?

I wanted to see what kind of "conversations" take place in Twitch chat, so I selected for references to other users and then again looked at the most common words under tf-idf. Unfortunately this method will miss many references (e.g. if there were a user who was Nick482392, other people might simply refer to him as Nick) but for an exploratory analysis, it seemed sufficient.

The most referenced person was, predictably, the streamer himself, with 1232 messages mentioning him. The top words for the streamer included "play" with countless suggestions for what other games the streamer should play. During this day, apparently another prominent streamer was talking about the streamer I analyzed, and many people commented on this. There were also many links directed at the streamer. There were no particularly negative words in the most common words directed at the streamer.

I also considered references to other users. There were 4697 of these, though some of these references are simply due to a user having the same name as an emote. Other than the emotes prevalent in general (Kappa, PogChamp), a common word among references was "banned," talking about people who had been banned from talking on the stream by moderators. An interesting thing to look at may also be to look at what kinds of things mods ban for and try to automate some of that process. Another common word was whisper, which was a feature recently added to Twitch. People are at least talking about this feature, which probably means it is getting used as well.


I then looked at all chat messages containing profane words to see if there were trends in how this language was directed. There were 5542 messages that contained profanity, with the most common word being variants of "fuck." The word "game" was often in posts with profanity, which isn't too strange because as mentioned earlier, a lot of people were complaining about the game choice on this day. Other words that were popular in general, such as kappa and kreygasm, were also present in posts with profanity.

The streamer had a visible injury on this day, and there were a few words related to this injury that correlated highly with profanity. These would be messages like "what the hell happened to your arm?" The streamer's name was also quite prevalent in messages that contained profanity.

A little less common than that was a reference to "mods." It seems that people get upset with moderators for banning people and possibly being too harsh. Right below this is "subs," whom there seems to be quite a bit of hostility towards. I'm not sure if this is when subscriber only chat was used, but the use of profanity with "subs" is spread out throughout all of the messages during the day.

There are some profane words that come in bursts (presumably as a reaction to what is happening on the stream). Terms like "sex her" seem to come in bursts, which seems to show some of the more sexist aspects of the Twitch chat ("sex" was a word included as profanity even though it may not qualify as that in all cases).


The ubiquity of emotes on Twitch may be an interesting reason to conduct general NLP research through Twitch chat. Many of these emotes have sentiments or intentions tied to them, and for the most part, people use them for the "right" purpose. For example, Kappa is indicative of sarcasm or someone who is trolling. Sarcasm is notoriously hard for NLP to detect so having a hint like the Kappa emote could reveal general trends in sarcasm [2]. This would be a cool application of machine learning to NLP (maybe a future blog post?).

From a more practical point of view, information like this could be useful to streamers to figure out how they are doing. For example, if a streamer is trying some techniques to get chat more involved, it may be interesting to see if they are successful and they manage to increase the number of chatters with many posts. One thing I didn't consider is how top words change from day-to-day. The game being played and other factors such as recent events may cause these to fluctuate which could be interesting. Of course, more sophisticated analyses can be conducted than looking at top words, for example, looking at the grammar of the messages and seeing what the target of profanity is.

I also just considered one streamer's stream (because I couldn't find many chat logs), and I'm sure it would be interesting to see how other streams differ. The streamer I analyzed is clearly an extremely popular streamer, but it may be interesting to see if the distribution of the engagement level of chatters is different on smaller of streams. It would also be interesting to see if the things said toward female streamers are particularly different than those said to male streamers.

The code I used for this post is available here.

1. Yin, D. et al., 2009. Detection of Harassment on Web 2.0. Proceedings of the Content Analysis in the WEB 2.0 (CAW 2.0) Workshop at WWW2009.
2. Gonzalez-Ibanez, R., Muresan, S., and Wacholder, N., 2011. Identifying Sarcasm in Twitter: A Closer Look . Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2, 581-586.

Amtrak and Survival Analysis

I got the idea for this blog post while waiting ~40 minutes for my Amtrak train the other week. While I use Amtrak a handful of times a year, and generally appreciate it I do find it ridiculous how unreliable its timing can be. (This is mostly due to Amtrak not owning the tracks they use, but I will not get into that here). Most of the delays I've experienced lately haven't been too bad (40 minutes is on the high end), but when I lived in North Carolina and was often taking the Carolinian train from Durham to Charlotte, the story was quite different. I can't remember a single time when the train arrived on time to Durham, and often the delays were over an hour.

This brings me to the point of this post, which is to answer the question, when should I get to the Durham Amtrak station if I want to catch the Carolinian to Charlotte? I'll assume that real-time train delay data isn't available so that past information is all I have to go off of. Certainly if all the trains are actually an hour late, I might as well show up an hour past the scheduled time and I would still always catch the train. Amtrak claims the Carolinian is on time 20-30% of the time, so presumably showing up late would make you miss about this many trains.

Fig. 1: Delay of arrival of the 79 train to the Durham station for each day since 7/4/2006 (with cancelled trains omitted). Note that in the first year and a half of this data, there are almost no trains that arrive on time, but the situation has improved over the years.

All Amtrak arrival data since 7/4/2006 is available on this amazing website. I got all the data available for the 79 train arriving to the Durham station. I've plotted the arrival times during this time in Fig. 1.

A simple frequentist approach

I can consider each train trip as an "experiment" where I sample the distribution of arrival times to the Durham station. The particular train I take is just another experiment, and I would expect it to follow the available distribution of arrivals. Thus, the probability of me missing the train if I arrive \tau minutes after the scheduled time is

 p(\tau) = \frac{N(t<\tau)}{N(t\geq 0)}.

Where N(t>t') counts the number of arrivals in the available data where the arrival t' is greater than the specified \tau. The question, then, is how much of the data to include in N(t>t'). To test this, I considered a half year's worth of data as a test set. Then, I figured out how much of the previous data I should use as my training set to most accurately capture the delays in the test set. I found that using a year of data prior to the prediction time worked the best. The method is not perfect; the percentage of missed trains predicted using the training set is about 10% off from the number in the test set, as there are year-to-year variations in the data.

A plot of p(\tau) using all the available data and only using the last year of data is shown in Fig. 2. Using only the last year to build the model, to have an 80% chance of making the train, one can show up about 20 minutes after the scheduled time. This also confirms Amtrak's estimate that their trains are on time 20-30% of the time. Even if one shows up an hour after the scheduled time, though, he or she still has a 36% chance of making the train!

Fig. 2: p(\tau) determined using all of the available data (blue) and only the last year of data (gold). I see that for delays longer than 60 minutes, the two curves are similar, indicating that for long waits either prediction method would give similar results. It appears that in the last year the shorter delays have been worse than the long-term average, as there are significant discrepancies in the curves for shorter delays.

A Bayesian approach

With a Bayesian approach, I would like to write down the probability of a delay, \delta, given the data of past arrivals, \mathcal{D}. I will call this p(\delta|\mathcal{D}). Suppose I have a model, characterized by a set of n unknown parameters \vec{a} that describes the probability of delay. I will assume all the important information that can be extracted from \mathcal{D} is contained in \vec{a}. Then, I can decompose the delay distribution as

 p(\delta|\mathcal{D}) = \int d^n \vec{a} \;\; p(\delta|\vec{a}) p(\vec{a}|\mathcal{D}).

Using Bayes theorem, p(\vec{a}|\mathcal{D}) can then be expressed as

 p(\vec{a}|\mathcal{D}) = \frac{p(\mathcal{D}|\vec{a})\pi(\vec{a})}{p(\mathcal{D})}.

Here, p(\mathcal{D}|\vec{a}) is the likelihood function (the model evaluated at all of the data points), \pi(\vec{a}) is the prior on the model parameters, and p(\mathcal{D}) is the evidence that serves as a normalization factor.I use non-informative priors for \pi(\vec{a}).

The question is, then, what the model should be. A priori, I have no reason to suspect any model over another, so I decided to try many and see which one described the data best. To do this, I used the Bayes factor, much like I used previously, with the different models representing different hypotheses. The evidence for a model \mathcal{M}_1 is f times greater than the evidence for a model \mathcal{M}_2 where

 f = \frac{p(\mathcal{D}|\mathcal{M}_1)}{p(\mathcal{D}|\mathcal{M}_1)}.

As the models are assumed to depend on parameters \vec{a} (note that a method that does not explicitly have a functional form, such as a machine learning method, could still be used if p(\mathcal{D}|\mathcal{M}) could be estimated another way)

 p(\mathcal{D}|\mathcal{M}) = \int d^n \vec{a} \;\; p(\mathcal{D}|\vec{a})\pi(\vec{a}|\mathcal{M}) = \int d^n \vec{a} \;\; \prod_{i=1}^N p(\delta_i|\vec{a})\pi(\vec{a}|\mathcal{M}).

Here, \delta_i are all of the delays contained in \mathcal{D}. This integral becomes difficult for large n (even n=3 is getting annoying). To make it more tractable, let l(\vec{a}) = \ln(p(\mathcal{D}|\vec{a})), and let \vec{a}^* be the value of the fit parameters that maximize l(\vec{a}). Expanding as a Taylor series gives

 p(\mathcal{D}|\vec{a}) = e^{l(\vec{a})} \approx e^{l(\vec{a}^*)}e^{\frac{1}{2}(\vec{a}-\vec{a}^*)^T H (\vec{a}-\vec{a}^*)}.

where H is the matrix of second derivatives of l(\vec{a}) evaluated at \vec{a}^*. The integral can be evaluated using the Laplace approximation, giving

 p(\mathcal{D}|\mathcal{M}) = \int d^n \vec{a} \;\; p(\mathcal{D}|\vec{a})\pi(\vec{a}|\mathcal{M}) \approx e^{l(\vec{a}^*)} \sqrt{\frac{(2\pi)^n}{\det(-H)}}\pi(\vec{a}^*|\mathcal{M}),

which can now be evaluated by finding a \vec{a}^* that maximizes p(\mathcal{D}|\vec{a}). (Regular priors much be chosen for \pi(\vec{a}^*|\mathcal{M}) since I have to evaluate the prior. I will ignore this point here). I tested the exponential, Gompertz, and Gamma/Gompertz distributions, and found under this procedure that the Gamma/Gompertz function described the data the best under this metric. Using this, I explicitly calculate p(\delta|\mathcal{D}), again under the Laplace approximation. This gives the curve shown in Fig. 3, which, as expected, looks quite similar to Fig. 2.

While this section got a bit technical, it confirms the results of the earlier simple analysis. In particular, this predicts that one should show up about 17 minutes after the scheduled arrival time to ensure that he or she will catch 80% of trains, and that one still has 30% chance of catching the train if one shows up an hour late to the Durham station.

Fig. 3: p(\delta|\mathcal{D}) calculated using only the last year of data. Note that this curve is quite similar to Fig. 2.


Since I downloaded all the data, 5 days have passed and in that time, the delay of the 79 train has been 22, 40, 51, 45, and 97 minutes. It's a small sample size, but it seems like the prediction that 80% of the time one would be fine showing up 20 minutes late to the Durham station isn't such a bad guideline.

Of course, both of these models described are still incomplete. Especially with the frequentist approach, I have not been careful about statistical uncertainties, and both methods are plagued by systematic uncertainties. One such systematic uncertainties is that all days are not equivalent. Certain days will be more likely to have a delay than other. For example, I am sure the Sunday after Thanksgiving almost always has delays. No such patterns are taken into account in the model, and for a true model of delays these should either be included or the effect of such systematic fluctuations could be characterized.

Predicting Fires

While organizing a data science workshop this summer, I realized that I hadn't ever written a blog post about the data science project I worked on last year. So, this post will be a summary of what I learned from working on the project.

The problem I worked on was Kaggle's Fire Peril Loss Cost. Given over 300 features, including weather and crime, for over one million insurance policies, we wanted to predict how much the insurance company would lose on a policy due to a fire. This is tricky as fires are rare events, and thus almost all policies have no loss.

We first implemented one-hot encoding to turn the categorical variables into boolean arrays indicating whether a certain category applied to a policy or not. This is better than assigning a number in succession to each category (e.g. Category 1=0, Category 2=1, Category 3=2,...) as doing this will assign a hierarchy/metric on the data, which could create spurious relations. One this encoding was done, all of our data was in a numerical form that was amenable for use in machine learning algorithms.

With so many features on so many policies, the entire dataset would not fit in memory on my laptop. Also, we found that not all 300 features were good predictors of the target value. Thus, we spent time selecting the most important features to make predictions.

There were a manageable amount of categorical variables, so we kept all of these, though we did try removing some of them to see if there was any performance benefit from doing so. For each continuous variable, we tried a model where the prediction was simply the value of the variable. Since the evaluation metric (a weighted gini index) only depended on the ordering of the predictions and not the magnitudes, this analysis method was amenable for all of the continuous variables. Notably, we found that one of the variables (var13) was already a reasonably good predictor of the target. We kept the 30 continuous features that scored best under this metric. We chose this selection method over other more common feature selection methods (such as PCA) to avoid some of the stability issues associated with them, but it may have been interesting, given more time, to see how various feature selection methods fared with one another.

Machine Learning
Since only ~0.3% of policies had any loss, we considered using a classifier to first identify the policies with loss. Ideally, the rate of policies with loss identified by this classifier as not having loss would be sufficiently low. Then a regressor could be used on the policies identified by the classifier as having loss, and then the training set would be less singular. We tried a few classifiers, but did not have much success with this approach.

We then tried to use regressors directly. We tried many of the machine learning regressors available in scikit-learn. We found good results from ridge (Tikhonov) regression and gradient boosted decision trees. In the end, we ended up combining the predictions of the two methods, which will be discussed a bit later.

Ridge regression is similar to standard linear regression, but instead of just minimizing the 2-norm of the vector of residuals, there is a penalty term proportional to the two-norm of the vector of coefficients multiplying the features. This penalty for feature coefficients ensures that no feature coefficients become too large. Large feature coefficients could be a sign of a singular prediction matrix and thus could fluctuate wildly. We found through testing that the optimal constant in front of the 2-norm of the vector of feature coefficients was quite small, so the ridge regression was acting quite similarly to linear regression. Note that with ridge regression it is important that all the features are normalized, as not doing this affects the size of the 2-norm of the vector of feature coefficients.

A decision tree is a set of rules (decisions) used to group policies into different classes. These rules are simple ones such as "is var13>0.5?" These rules are chosen at each step so that they best split the set of items (the subsets should mostly all be of the same values). There are different notions for what best means here, but using the gini impurity or information gain (entropy) are common choices. With enough rules, given a grouping, each policy in the group will all have similar target values. A new policy with an unknown target can then be compared against these rules and the target can be predicted to take on the value of the target in the group of policies it ends up with. Note that the depth of the tree (the number of rules) needs to be limited so that the method does not overfit. One could come up with enough rules such that each policy is its own group, but then the method loses predictive power for a new policy with an unknown target value.

On its own, the decision tree is not great at regression. The power of the method comes from the "gradient boosting" part of the name. After a decision tree is created, there will invariably be some policies that are misclassified. In analogy to gradient descent, the decision tree is then then trained to fit the residuals as the new target variable (or more generally the negative gradient of the loss function). This corrects for errors made at each iteration, and after many iterations, makes for quite a robust regressor and classifier.

We got good performance from each of these methods, and the two methods arrive at the prediction in very different ways. Thus, we combine the results from these two machine learning methods to arrive at our final prediction. We considered a standard mean as well as a geometric mean for the final prediction. We found that the geometric combination was more useful. This seems reasonable in predicting rare events as then both methods have to agree that the prediction value is large to net a large prediction, whereas only one method has to have a small prediction value to net a small prediction.

Other Things
We probably could have dealt with missing entries better. It turned out that many of the features were strongly correlated with other features (some even perfectly) so we could have used this information to try to fill in the missing features. Instead, we filled all missing entries with a value of 0. In general, it's probably best to treat missing features as a systematic error and the effect could be quantified through cross-validating by considering various scenarios of filling in missing entries.

It also turned out that some of the features that were labeled as continuous were not actually continuous and were discrete (there were only a few values that the continuous variable took). There may have been some performance benefit from implementing one-hot encoding on these as well.

For the ridge regression, we could have applied standard model selection methods such as AIC and BIC to choose the key features. For the gradient-boosted decision trees, using these methods is a bit trickier as the complexity of the fit is not easy to determine.

One lesson I learned was the importance of cross-validation. k-fold cross-validation randomly splits up the training set into k subsets. Then, each subset is used as a test set with the complement used as the training set. This gives an idea of how well the model is expected to perform and also how much the model may be expected to overfit. The cross-validation estimate of the error will be an overestimate of the true prediction error, since only a subset of the data is used for prediction, whereas the whole dataset would be used for a true prediction [1]. Ideally, one is in a regime where this difference is not crucial. By adjusting the value of k, a good regime where this is true can be found.

While we did split up our data set into the halves and test by training on one half and predicting on the other, we could have been more careful about the process. In particular, the feature selection process could have been carefully verified. In addition, we trusted our position on the Kaggle leaderboards more than our cross-validation scores, which led our final predictions to overfit to the leaderboard more than we would have liked.

1. Hastie, T., Tibshirani, R., and Friedman, J. 2009. The Elements of Statistical Learning.

Pedestrian dynamics

I've written about modeling the movement of cars as a fluid in the past. We could think about pedestrians like this, but usually pedestrians aren't described well by a fluid model. This is because while cars are mostly constrained to move in one direction (in lanes), this is not true of pedestrians. On a sidewalk, people can be walking in opposite directions, and often someone walking in one direction will directly interact (by getting close to) someone walking in the opposite direction. There are some specific scenarios where a fluid model could work, such as a crowd leaving a stadium after the conclusion of a basketball game. In this case, everyone is trying to get away from the stadium so there is some kind of flow. However, this doesn't work generally, so I will consider a different type of model, similar to the one described in [1].

If there are only two directions that people want to travel and they happen to be opposite, then we could model the pedestrians as charged particles. The pedestrians that want to go in opposite directions would be oppositely charged, and the force that keeps the pedestrians on a trajectory could look like an electric field. However, this would mean that people moving in opposite directions would attract each other, which really does not match expectations. This model also fails if there are multiple directions where pedestrians want to go (such as at an intersection), or if the desired directions of the pedestrians are not opposite. While a plasma (which is a collection of charged particles) model may not be the best to describe the scenario, I will borrow some ideas from plasma dynamics in building my model and I will use techniques used to simulate plasmas to simulate pedestrian movement.

There will be a few effects governing the movement of pedestrians. One effect is for the pedestrians to want to go a desired direction at a desired speed. It turns out that most humans walk at a speed of around v_d=1.4 m/s (3.1 mph) and if someone is going slower or faster than this, they will tend to go toward this speed. Let me call the desired speed along the desired direction of the pedestrian \vec{v}_d, and the current walking speed of the pedestrian \vec{v}. I will model the approach to the desired direction as a restorative force that looks like

 \vec{F}_{restore} = \frac{\vec{v}_d-\vec{v}}{\tau}.

Here, \tau represents how long it takes the pedestrian to get back to their desired position and direction once they are off track. In general, \tau could be different for every pedestrian, but for simplicity I set it as a constant for all pedestrians here, and I will take it to 0.3 s, which is close to the human reaction time. Note that the restorative force is zero when \vec{v_d}=\vec{v}, so if a pedestrian is already going in their desired direction at their desired speed, there will be no restorative force and the pedestrian will continue to go at this direction and speed. You may find it odd that my force has units of acceleration. I am thinking about this more as a generalized sense of the term force as in something that causes velocity changes, but it would also be reasonable to assume that I have set the mass of the pedestrians to 1.

Pedestrians will also avoid colliding with each other, which is the other force I include in the model. While [1] assumes an exponential force for the interaction force, I will assume that pedestrians interact via a generalized Coulomb potential. The general results seem to match without too much regard for the exact shape of the force. I define the force between pedestrian i and pedestrian j is

 \vec{F}_{ij} = \gamma \left(\frac{|\vec{r}_{ij}|}{r_0}+\epsilon\right)^{-\alpha}\hat{r}_{ij}.

Where \vec{r}_{ij} = \vec{r}_i - \vec{r}_j. \gamma, \epsilon, \alpha, and r_0 are constants that I will describe below. r_0 is an interaction radius that sets the scale for this interaction. This would not necessarily be the same for everyone. For example if someone is texting, their interaction radius r_0 is probably much smaller than someone who is paying attention to where they are going. However, for simplicity I take it to be the same for everyone, and I take it to have a value of 1.2 m.

Since pedestrians travel in 2 dimensions, if \alpha = 1 and \epsilon = 0, this would be the Coulomb potential, if \gamma were aptly chosen. In this scenario, however, I do not really want the Coulomb potential. The Coulomb potential is quite long range, meaning that particles in a Coulomb potential can influence particles that are quite far away. As the power \alpha in the equation above gets larger, the force becomes more short-range, which seems to better model the interactions of pedestrians. However, this presents another problem in that the force gets extremely large if two pedestrians happen to get really close to one another. To combat this, \epsilon is a small number that "softens" the force such that the force never gets extremely large (which I took to mean |\vec{F}_{ij}| should never be too much bigger than the maximum possible value of |\vec{F}_{restore}|). \gamma then decides the relative importance of this interaction force to the restorative force.

I will simulate this model by considering N people are in a long hallway with aspect ratio 1:10, for example at an airport or a train station. This can also be a model for a long, wide sidewalk as even though there are no walls, people are relatively constrained to stay on the sidewalk. I have some people trying to get to one end of the hallway (in the +\hat{x} direction) and some trying to get to the other end (in the -\hat{x} direction). This is an example of an N-body simulation, which is widely used in studying gravitational systems and plasma systems.

In [1], the walls exerted an exponential force on the pedestrians. I choose a similar model. I set the parameters of the exponential empirically such that the pedestrians keep a reasonable distance from the walls. I set the range of the exponential force to be a tenth of the total width of the corridor. I set the magnitude such that at the maximum, the force due to the wall is the same as the maximum value of |\vec{F}_{restore}|.

When a pedestrian reaches the end of the corridor, I record how much time it took for that pedestrian to traverse the corridor. I then regenerate the pedestrian at the other end of the corridor as a new pedestrian. I generate the new pedestrian with a random y coordinate and a random velocity direction, but pointing at least a small bit in the desired direction. The magnitude of the velocity is taken to be v_d. Thus, the simulation is set up such that there will always be N people in the hallway.

A simulation of N=100 people in a hallway of dimensions 100 m x 10 m. All pedestrians desire to go to the left of the hallway. The pedestrians relax to a state where they are each about the same distance from each other. It seems that people usually stand closer together on average, so our value of r_0 should probably be smaller to match observations.

The first thing I tried was to simply put a few people in the hallway all wanting to go in the same direction, and see what they do. I set the length of the hallway to be 100 m, which made the width of the hallway 10 m. As can be seen above, this isn't too exciting. The pedestrians' paths are mostly unobstructed and they get across the hallway in about 71 s, which is the length of the hallway divided by 1.4 m/s, the comfortable walking speed of the pedestrians. Even in this simple case, though, it is apparent that the pedestrians "relax" into a scenario where the average distance between the pedestrians is roughly the same.

A simulation of N=100 people in a hallway of dimensions 50 m x 5 m. Pedestrians are equally likely to want to go left or right. We can see that lanes of people that would like to go in the same direction can form, as was observed in [2]. This effect could be even stronger with an extra "incentive force" for people to be on the right side of the road if they are not already on that side.

The time required to cross the room as a function of density of people in a simulation of N=100 people. The y-axis is normalized by the length of the room divided by the desired velocity (1.4 m/s). p=0.5, which means half of the pedestrians desire to go to the left and the other half desire to go to right. I change the density of people by changing the size of the room from 2.5 m x 25 m to 10m x 100 m. As the density is higher, the pedestrians interact more with each other and thus are less likely to be on their desired trajectory.

Next, I looked at the more interesting cases of what happens when there are pedestrians that want to go in different directions. First, I assume that exactly half of the pedestrians would like to go in one direction and half would like to go the other direction. I then varied the length and width of the hallway, keeping the aspect ratio constant, while keeping the number of people in the hallway constant at 100. This has the effect of changing the density of people in the hallway. The y-axis on the graph above is normalized by L/v_d, which is the time a pedestrian with all of his or her velocity in the desired direction would take. This shows that as the density increases, it takes longer (proportionally) for the pedestrians to get across the room. This makes sense as the pedestrians are interacting more often and thus cannot keep going in the desired direction.

A simulation of N=100 people in a hallway of dimensions 50 m x 5 m. 90% of pedestrians want to go left, while the other 10% want to go right. The right-going pedestrians undergo many interactions with the left-going pedestrians. In fact, if the left-going pedestrians were denser, this could look like Brownian motion.

The time required to cross the room as a function of p, the fraction of N=100 people that would like to go left or right. The y-axis is normalized by the length of the room divided by the desired velocity (1.4 m/s). The size of the room is 5 m x 50 m. The blue line is the time to get across for the pedestrians going leftward, and the red line in the time to get across for the pedestrians going rightward. As the fraction of pedestrians going leftward increases, it becomes easier for those pedestrians to get across, but it makes it harder for the pedestrians that would like to go in the opposite direction to get across more slowly.

I then took the number of people in the hallway to be 100 with the length of the hallway being 50 m and the width being 5 m. I observed what happened as I varied the fraction of pedestrians, p that wanted to go in either direction. This effect is shown above. As p is increased, the more dominant pedestrians can get through the corridor more quickly than the less dominant pedestrians. Again, this makes sense as when people go "against the gradient," they have to weave through people to try to get to the other side.

I will note that I have not done this simulation in the most efficient way. For every pedestrian, I calculate the interaction force with all the other pedestrians and add up all the contributions. It turns out one can average or sometimes even ignore the effect of pedestrians far away, which can make the code run about 1/N times faster.

The python and gnuplot scripts I used for the simulation and to create the plots are available here.

1. Kwak, J., 2014. Modeling Pedestrian Switching Behavior for Attractions. Transportation Research Procedia. 2. 612-617.
2. Tao, X., 2011. A Macroscopic Approach to the Lane Formation Phenomenon in Pedestrian Counterflow. Chinese Phys. Lett. 28.

Ising model and voting patterns

The Ising model is a ubiquitous model in condensed matter physics as it is a simple model with nontrivial behavior. It is one of the few systems that has analytic solutions, though this only holds for the Ising model in low dimensions. Despite its simplicity, it does quite well at describing the ferromagnetic systems.

While the Ising model is typically applied to ferromagnetic spin systems, even within physics it can also be used to describe a gas that lives on a lattice. Here, I will show how the model can be extended outside of physics and describe how the Ising model relates to voting.

Assume there are N people that will vote in an election. For simplicity, assume there are just two candidates in the election (or the vote is for a yes/no proposition). Let the state of voter i be characterized by a variable s_i that take the value of +1 or -1, corresponding to voting for one of the two outcomes of the election. The assumption is that each voter will interact with other voters and convince them that they should vote a certain way. Let the energy associated with a voter agreeing with someone they interact with be -\epsilon and the energy associated with a voter disagreeing with someone they interact with be +\epsilon.

For each person, there will be a propensity to vote a certain way, such as being affiliated with a political party or being subjected to advertising campaigns. Let the energy of this propensity be called h_i for person number i. Let h_i be measured on a scale of -\beta (always voting for candidate -1) to +\beta (always voting for candidate +1). If a voter has a 0 value for this propensity, they are a "swing voter" and can be convinced by his or her neighbors to vote for either the +1 or -1 candidate. In a physical system, this propensity is equivalent to the existence of an external magnetic field in the spin system. Putting all this together, the total energy among the N voters is

 E = -\epsilon \sum_{<ij>} s_i s_j+\sum_i h_i s_i.

The angled brackets here mean only take the sum if i and j are able to interact. This is in here as clearly all N voters will not talk to every one of the others unless N is very small. This equation is precisely the equation for the Ising model. The equilibrium state will be one when this energy is minimized (though this minimum energy state is usually not unique). Now suppose all of the voters are on a 2D lattice (or grid), meaning that each voter interacts with 4 other voters. In the \beta=0 case (no political parties), then an analytic solution exists.


Configuration of voters on a 2D lattice with \beta=0. Black are voters for candidate +1 and white are voters for candidate -1. The left is the case where \frac{\epsilon}{T} = 0.5, the right is where \frac{\epsilon}{T} = 0.33. There is clearly clustering of similarly minded people on the left, but there is no such structure on the right. This is an example of a phase transition.

As shown above, in this case, depending on whether the "social temperature" is above or below a critical temperature, the voters will have random opinions or will form large regions of agreement. Note that as before, this temperature does not correspond to how hot it is where the voters live, but a measure of how nonideal voters are willing to be with regards to the model. If the temperature is high, random fluctuations dominate and so there is no correlation among voters. However, if the temperature is low, the voters tend to form regions where they agree with others. The critical temperature can depend on the issue at hand. I tend to agree with most of the people I associate with, so in my experience it seems that voting systems are usually in the low energy limit. However, there are some issues that are not as clear (such as GMO labeling) where the voting system can be in the high energy limit.

Now let me consider the effect of political parties. This case cannot be solved analytically, so the only way to solve the problem is numerically. The best way to do this is using the Metropolis algorithm. In this method, we first assign randomly to all voters which candidate they will vote for (give them +1 or -1). We then choose a voter at random and consider the energy change \Delta E if we changed that voter's vote. If switching the vote will decrease the total energy of the system, then the voter's vote is changed. If not, the voter's vote is changed with probability e^{-\Delta E/T}, where T is the temperature of the system. If T=0, this second condition never occurs and thus a voter's opinion is changed only if it decreases the total energy. Thus, as mentioned earlier, a nonzero temperature has the effect of measuring how nonideal voters are willing to be which corresponds to increasing the energy. This probability is called a Boltzmann factor and is ubiquitous in statistical physics. After this is done, another voter is chosen randomly and the process is repeated many times. After a long time, this algorithm produces a state that is typical of an equilibrium configuration of the voters.

I assume my voters live on a 2D lattice. This means that each voter interacts with 4 other voters and no one else. One reason I choose this is that it is particularly easy to visualize. I chose the number of voters to be 62500. I repeated the Metropolis algorithm 10^7 times, so that every voter will probably be considered by the algorithm many times. I took periodic boundary conditions such that voters on the edge interact with voters on the other edge. This insured that every voter interacted with 4 other voters.

Configuration of voters on a 2D lattice with \frac{\epsilon}{T} = 0.5 and \frac{\beta}{T}=2. The h_i values for each voter were assigned randomly between -\beta and +\beta. Black are voters for candidate +1 and white are voters for candidate -1. Unlike the \beta=0 case, there is now less clustering of like-minded voters.

The first thing I tried is shown above. I randomly assigning values of h_i between -\beta and +\beta to each of the voters. I still assign randomly the initial configuration of votes. This has the effect of adding more disorder to the voters, and can prevent clusters of similarly minded voters from forming, even at low temperatures.


Configuration of voters on a 2D lattice with \frac{\beta}{T}=0.5, with \beta distributed from negative to positive linearly from one end of the lattice to the other. Black are voters for candidate +1 and white are voters for candidate -1. The left is the case where \frac{\epsilon}{T} = 0.5, the right is where \frac{\epsilon}{T} = 0.33. There is a tendency in both images for votes to cluster toward one side.

This, however, doesn't seem too realistic, as like-minded people usually associate with each other. Because of this, I tried adjusting h_i such that it varies linearly from one edge of the lattice to the other. This has the effect of encouraging order, and predictably groups similarly minded people (when \beta=0, the choice of grouping is random and is an example of spontaneous symmetry breaking).

When the choice of h_i is such that positive and negative values are equally likely, there is no clear winner in the election (the difference in +1 and -1 votes is usually very close to 0). However, if there is an asymmetry between the number of people who can have negative h_i values and positive h_i values, this is usually a good indicator of who will win the election. This asymmetry could be due to something as simple as one campaign having better ads than the others and getting the swing voters.

One extension that could be made is to have different weights for certain interactions. Reading a distant friend's post on Facebook is probably less relevant to you than a discussion with a close family member. To account for this, the value of \epsilon could be adjusted such that it takes on different values for different pairs of interactions. Another possible extension is to consider different interaction schemes. It seems unlikely that every voter would have only four people they interact with. We could do something more realistic such as having a range of a number of interactions each voter can have. This is a little more complicated to program, however, as a different data structure (a graph) has to be used.

You can see the code I used here.

1. Schroeder, Daniel. 2000. An Introduction to Thermal Physics.

City Population and the Fokker-Planck equation

Zipf's law states that many types of data, when sorted, look like a Pareto distribution and when plotted on a log-log scale look like a straight line. In other words, many features tend to have power law behavior. This is simply described by the Pareto principle which states that often a small percentage of causes account for a large percentage of effects. This is exemplified by the 99% slogan used by the Occupy movement stating that the top 1% of people in the U.S. earned most of the income.

Zipf's law seems to be quite ubiquitous. Examples where similar trends have been observed are the number of crimes criminals commit, the amount of healthcare patients receive, the amount of complaints customers give, and many others [1]. This article is about the distribution of people across cities, which also follows Zipf's law.

[2] and [3] look at Zipf's law for wealth distribution. Here, I'll present the same basic argument they do, but in terms of population, as this seems a little more intuitive to me.

The above is a plot of the population of U.S. cities with more than 50,000 people by their rank in size. The data is from [4]. The orange line is the linear (on a log-log scale) fit. As you can see, the line fits the data quite well, especially toward the smaller cities. The slope of the line is about 0.725.

Now I will consider the mathematical model. Say there are \mathcal{N} cities. The only assumption that I will make is that people can move between two cities following a simple rule. I will assume that two cities can move some (presumably small) constant \beta times the minimum of the population of the two cities. I assume that the flow is random, such that this movement can occur either way.

Now assume two cities have population n and n'. The change, \Delta, in population of the city with population n after someone moves is

 \Delta = \beta r \min(n,n').

The change in population of the city with population n' is -\Delta. r is a random variable that has a 50% chance of being +1 and a 50% chance of being -1. The PDF, P(n,t) of a system undergoing a Markovian process of small steps like this is given by the Fokker-Planck equation.

 \frac{\partial P}{\partial t} = - \frac{\partial}{\partial n}\left(<\Delta>P\right)+\frac{\partial^2}{\partial n^2}\left(\frac{<\Delta \Delta>}{2}P\right) = \beta^2 \frac{\partial^2}{\partial n^2}\left(\left(\frac{n^2}{2}A(n)+B(n)\right)P\right)


 A(n) = \frac{1}{\mathcal{N}} \int_n^\infty dn' P(n'),

is the Pareto function and

 B(n) = \frac{1}{\mathcal{N}} \int_0^n dn' P(n')\frac{{n'}^2}{2}.

The angled brackets here represent expectation values. For a derivation of this relation, see [3]. In plasma physics, if the plasma is sufficiently dense, the constituents of the plasma will have many interactions with one another. If one particle in the plasma is looked at, the path looks like a simple random walk, which is a Markovian process. Thus, the PDF of the plasma will obey what looks like a Fokker-Planck equation. Similar arguments can be used to obtain a similar relation for the PDFs of particles undergoing Brownian motion. The Fokker-Planck equation has even found uses in mathematical finance.

One thing to note is that the value of \beta cannot change the dynamics. I can change my units of time to absorb \beta, and so nothing critical should depend on it (and I do not include \beta when I do the numerical integration). The PDE is not easy to solve analytically as it mixes up derivatives and integrals of the PDF P(n,t). Thus, I solve it numerically by assuming an initial distribution for P(n,0) and letting it evolve in time until it seems like the PDF has reached a steady state. The initial distribution I assumed was exponential.

The population distribution of the model. Given the blue initial conditions, the system evolves to the purple line which looks linear over a large number of ranks. The straight orange line has slope 1.05. I believe that the kink that develops is a numerical effect and could be reconciled if the step size in space were smaller. However, because of the CFL condition this requires an even smaller time step and thus would make the code take an even longer amount of time.

Here, it is apparent that the PDF approaches a straight line in the log-log plot, as was apparent in the U.S. census data. It turns out that the choice of the initial condition does not matter much. In [2] the authors choose a much different initial condition but obtain a similar steady state solution after a long time.

One key difference between the result of this model and the U.S. population is that the slope of the lines are quite different. In the plot above, the slope of the line is about 1.05, while the U.S. population has a slope of about 0.725.

One way to modify the model to potentially reconcile this is to allow for moving to and from a city to have different probabilities. Intuitively, I'd expect more people would want to go from a smaller city to a bigger city rather than the other way around. It turns out this effect is pretty easy to incorporate. In the relation above for \Delta, I change r such that it is a random variable that has an a chance of being +1 and a 1-a chance of being -1. Before, I took a=1/2.

In this case, <\Delta> is no longer zero but the other term is exactly the same. The expression becomes

 \frac{\partial P}{\partial t} = - \beta(2a-1)\frac{\partial}{\partial n}\left((C(n)+nA(n))P\right) + \beta^2 \frac{\partial^2}{\partial n^2}\left(\left(\frac{n^2}{2}A(n)+B(n)\right)P\right)


 C(n) = \frac{1}{\mathcal{N}} \int_n^\infty dn' P(n')n',

Note that now \beta is almost a relevant parameter as it can no longer be completely absorbed into the definition of time. The relevant parameter here that will control the dynamics will be \frac{2a-1}{\beta}.

The population distribution of the modified model with \frac{2a-1}{\beta} = 0.2. Given the blue initial conditions, the system evolves to the purple line which looks linear over a large number of ranks. The straight orange line has slope 0.8.

I solved this numerically in the same manner as before. In the above plot, the slope of the line is about 0.8, which is much closer to the population data. All this required was for moving to a big city to be about 10% more desirable than moving away from the city (though the exact percentage depends on the relationship between a and \beta). I'm sure by tuning the ratio \frac{2a-1}{\beta} further I could get it even closer to the 0.725 slope observed in the U.S. population.

There are also other factors that could be considered, such as the birth and death of city residents as well as immigration. [2] describes how to take these into account, but for the sake of simplicity I will not consider it here. An interesting aspect is that including just one of these cannot change the behavior (other than the size of the population increasing or decreasing), though adding both can lead to different behavior.

As I was writing this, I was trying to think whether something like crime could be modeled as a similar way, but I couldn't think of a reasonable exchange (like the \Delta expression) that could be written. It would be interesting to look into how Zipf's law is emergent in other systems as well.

The code is available here. The code runs pretty slowly, it has not been optimized. The easiest optimization would probably be to do the computations on a GPU using pyCUDA. I'll leave that as an exercise to the reader ;).

1. Reed, William J.; et al. 2004. The Double Pareto-Lognormal Distribution – A New Parametric Model for Size Distributions. Communications in Statistics – Theory and Methods 33 (8), 1733–1753.
2. Boghosian, Bruce M. 2014. Kinetics of wealth and the Pareto law. Physical Review E 89, 042804.
3. Boghosian, Bruce M. 2014. Fokker-Planck Description of Wealth Dynamics and the Origin of Pareto’s Law. arXiv.