# Who will win Top Chef Season 15?

Warning: Spoilers ahead if you have not seen the first three episodes of the new season

Since I was very wrong last year about who would win Top Chef, I decided to take another stab at it.

This time, I approached it slightly differently. I predicted the win probability after each episode rather than each chef's ultimate placement. I did this since there are not that many relevant features to predict the placement. When I was validating the model on season 14, I found that the model was overfitting when using the location features (e.g., from California), so I decided to leave these features out as well.

The other big change was I decided to use an RNN model to process the sequence of episode performance for each chef. This way, I could train the model on all data, not just data from the first episode, but still, gain useful insights for the first episode.

The output of the RNN was then combined with features for gender, and whether the season had last chance kitchen to get the final prediction. I validated this with season 14. I tuned the training parameters to try to give Brooke the best chance of winning, though it was still challenging as she was eliminated.

Unfortunately, even with the last chance kitchen feature, the model is having a hard time accounting for chefs coming back. This isn't too surprising considering Kristen is the only one who won last chance kitchen and ended up winning the whole thing (until season 14). This is less relevant for predictions after the first two episodes, as a chef eliminated in the first two episodes is extremely unlikely to come back through last chance kitchen. The predictions after the first episode aren't too different from last year's model. Casey is the favorite, though Sheldon comes in as second instead of Katsuji.

After choosing optimal parameters using season 14, I use all of seasons 1-14 to train a model to predict on season 15. With this, after the first episode, I get the following win probabilities for each chef.

As expected, the winner of the first challenge (Tyler) is the favorite to win. I was delighted to see Tu come in second since I've been to his amazing popups and would love to see him win.

It's taken me a while to actually write this post, and since then, Claudette, Laura, and Rogelio have been eliminated, who were all in the 4% chance of winning crowd, so the model seems to be holding up so far!

# What does Kappa mean?

I've written about some basic NLP on twitch chats before. This post is an extension of that, with more data, more sophisticated methods, and hopefully better results!

Word2vec is an algorithm that takes a one-hot encoding of words, which is a sparse, high dimensional representation, and maps it to a dense, low dimensional representation. In this low dimensional word embedding, the distance metric between words represents how likely words are to appear near each other in a sentence. This results in synonyms and related words being close to each other. In fact, misspelled words (which are common in twitch chats) are often the closest synonym for words. The low dimensional representation is learned by training a shallow neural network to predict a word given the words around it. This has some intriguing properties like being good at analogies (which will be discussed later).

So, I took a bunch of twitch chats I could find and trained a word2vec model on it. This was from chats of about 360 streamers over the past four years. Unfortunately, this isn't the most unbiased source of data. Clearly, larger streams have more chatters, and thus their chats will be overrepresented. In addition, the 360 streamers are a small fraction of all streamers on twitch. In fact, none of the streamers I regularly watch had available chat logs.

I did some cleaning and took out single word messages as well as anything twitchnotify and bots said ¹. Even if a chat is actually multiple sentences, I assume that each chat message is a sentence for the word2vec training. I also strip all symbols in messages to ignore any punctuation. This does have the downside of making words such as <3 and 3 equivalent. I chose a context size of 2, and require a word to appear more than 50 times in the corpus to train the word2vec model. I train a 200-dimensional word2vec model.

Once the word2vec is trained, the cosine distance between word vectors can be used to determine their similarity. This showed that the word closest to Kappa was, unsurprisingly, Keepo. This is followed by 4Head and DansGame; also twitch emotes. The closest non-emote word closest to Kappa was lol. This was unsatisfying for me because I feel like information is lost in translation from equating Kappa with lol, but it makes sense. Kappa is likely to appear at the end of sarcastic sentences, and it's quite reasonable for lol to occur in a similar context.

I then looked into analogies. Word2vec has a cool property that by adding and subtracting word vectors, a relationship between two words can be added to another word. The textbook example of this is man+queen-woman=king. That is, the relationship of the difference between queen and woman (royalty) is being added to man to get king. My word2vec model did, in fact, learn this relationship. With some of the game-related analogies, the model fares a bit worse. The expected analogy result was not necessary the closest vector to the vector sums but would be one of the closest.

The top three closest word vectors to the vector sum (or difference) of word vectors shown. This shows that the models may not learn relationships between words entirely, but is developing a pretty good idea.

Next, I plotted how words were distributed globally in the word embedding. I used PCA to reduce the 200-dimensional word embedding to 2 dimensions to visualize the relationships. What this showed was that foreign words cluster separately from English words. This makes sense, as it should be rare to combine German words with English ones in the same sentence. Another effect was the commonly used words clustered together, and then there is a region of context-specific words and meme words. Sub emotes are an example of context-specific words, as they are likely to be found only in the chat of one streamer, in which similar chat topics are present. A meme word would be something like the "word" AAAAE-A-A-I-A-U- that usually only appears with the Brain Power meme, and are unlikely to show up in any other context.

The word embedding of all the words in the corpus, with PCA used to reduce the dimensions from 200 to 2. Each dot represents a word. Natural clusters form in the word embedding.

Zooming into the common words area, the relationships between words become apparent. Most of the global emotes are toward the lower half of the common words range, while games sit on the top half. TriHard is closer to the left, getting close to the context-specific range, which makes sense as while TriHard is a global emote, it's probably used more in TriHex's chat. The politicians cluster together, with Obama closer to Clinton than Sanders or Trump.

Zooming into the common words part of the previous graph, and visualizing some of the words.

With the success of vector representation of words, a natural extension is a vector representation of chat messages. This can be a useful way to classify similar sentiments or intents between chatters. A simple way to get a vector representation of sentences is to take an average of the word vectors in the sentence in a bag of words approach, ignoring any words that do not have a vector representation. This is a commutative operation, so it is not a perfect representation of a sentence. For example, the sentences "the cat big the dog" and "the dog bit the cat" have different meanings.  However, this is a good starting point to capture the overall intent of sentences.

I cluster the sentences to determine relationships between them. I sampled a set of 1,000,000 chat messages as this was about as much my computer could handle. Typically, I would use DBSCAN to cluster, but as this was computationally prohibitive, I opted for minibatch k-means. I chose the cluster size as 15, and 6 large clusters emerged from the clustering, shown below.

The sentence vectors of 1,000,000 chat messages, with PCA used to reduce the dimensions from 200 to 2. Each dot now represents a chat message, calculated as a sum of the words vectors in the chat message. The different colors represent clusters found by K-means.

Like the words, sentences containing foreign phrases cluster separately from the rest of sentences. Likewise, chats with sub emotes, and channel specific memes tend to cluster together. Spamming of global emotes was another cluster, and there is reasonably another cluster where global emotes are combined with text chat messages. Chat messages without emotes tend to cluster into two regions: one where the chatter is interacting with the streamer, and another where the chatter is talking about themselves or referencing others in chat (that is the personal pronouns category). These are general trends, and six clusters are not enough to capture all intents of chatters, but this gives a broad idea.

As mentioned earlier, in the context of twitch, this wasn't using that much available data. I'd expect that training on more data might help the model better learn some of the analogies. Another intriguing prospect is how the word embeddings change. I'm sure relationships between words like Clinton and Trump evolved over the course of last year's election. This raises interesting questions about what time period that word2vec should use as a training corpus.

Code for this post is available here.

1. I define bots as tags containing the string 'bot.'

# Making Sense of Polls - Part 1

I'm a huge fan of fivethirtyeight (really I read almost every article written), and I think they do a great job of interpreting poll results. However, I do notice that the process in which they compile the results is tedious. Some of the tediousness, like collecting poll results, is inevitable. However, other work including figuring out which polls are reliable and how to interpret polls across different states seems like it might be automatable. I took a stab at this automation using data that fivethirtyeight compiled, which has gathered polling data for various state and national elections three weeks before the election date. This was about 6200 polls, which is a relatively small sample to train a machine learning model on.

I've identified some key points that need to be figured out for any election forecast:
1) How to combine poll results (assuming all polls are good). This will be some form of a moving average, but what kind of moving average taken is up for debate.
2) How to decide which polls are good. Dishonest polls are a problem. If a pollster is partisan, there needs to be a way to take into account.
3) Estimating the uncertainty in the combined polls. The sampling error is relatively small for most polls, but if pollsters choose not to publish a poll if it disagrees with the conventional wisdom, this can introduce bias. There is also uncertainty about how undecided voters and third-party voters will swing.
4) How to determine correlations in the polls. That is, if a candidate performs worse than the polling average would suggest in Pennsylvania, there is likely to be a similar pattern in Wisconsin.

The last issue was tricky, and will not be covered here, but the first three issues are discussed in this post.

I tackle the problem as a time series prediction problem. That is, given a time series (when the polls happen and their results) I want to predict the outcome of the election. This time series can be interpreted as a sequence of events, which means recurrent neural networks (RNNs) are well-suited to solve the problem. RNNs even handle different-length sequences quite naturally, which is a plus as this is awkward to encode into useful input for other types of models like a tree-based model.

I use the data before 2014 as a training set and 2014 as a validation set. Once I've tuned all the parameters on 2014, I retrain the model with both the training and validation set and predict for 2016 presidential races.

In this post, I tackle the problem of predicting individual races, instead of whole elections (or the entire presidential race, which is equally complex). For each poll, I compile a set of information about the polls:
• How each candidate in the race polled, in order of democrat, republican, and highest polling third party candidate, if available. The ordering is relevant as I account for the political preferences of polls.
• The number of days before the election that the poll was conducted¹.
• The sample size of the poll².
• The partisan leaning of the polling agency, if known
• Whether a live caller was used to conduct the poll (default to false if unknown).
• Whether the poll is an Internet poll (default to false if unknown).
• Whether the pollster is a member of the NCPP, AAPOR, or Roper organizations.
• Of polls conducted before the poll, the percentile of the number of polls the pollster has conducted relative to all other pollsters. The intuition here is that agencies that do one poll are not very reliable, whereas agencies that have done many polls probably are.

All of this information is collected into a single vector, which I will call the poll information. I then take all polls of the same race previous to that poll and make an ordered list of the poll informations, which is the sequence that is the input to the neural network model.

With this as input, I have the neural network predict the ultimate margin of the election. I do not include any sense of "year" as input to the neural network as I wish the model to extrapolate on this variable and hence I do not want the model to overfit to any trends there may be in this variable. I use three LSTM layers with 256 units followed by two fully connected layers with 512 neurons. The output layer is one neuron that predicts the final margin of the election. Mean squared error is used as the loss function. I initialize all weights randomly (using the keras defaults), but there might be a benefit to initialize by transfer learning from an exponentially weighted moving average.

I use dropout at time of prediction as a way to get an estimate of the error in the output of the model. The range where 90% of predictions lie using different RNG seeds for the dropout gives a confidence interval³. To calibrate the amount of dropout to apply after each layer, I trained the model on a training set (polls for elections before 2014) and tested different levels of dropout on the validation set (the 2014 election). I find the percentile of the ground truth election result within the Monte Carlo model predictions. Thus, a perfectly calibrated model would have a uniform distribution of the percentile of ground truth election results within the Monte Carlo model predictions. Of course, I do not expect the model to ever be perfectly calibrated, so I chose the dropout rate that minimized the KS-test statistic with the uniform distribution. This turned out to be 40%, which was comforting as this is a typical choice for dropout at training.

A comparison of the calibration (blue) and ideal (green) CDFs for predictions on the test set. For the calibration curve, the optimal dropout of 40% is used.

I then retrain the model using all data before 2016 (thus using both the training and validation set). I take the calibrated dropout rate and again get Monte Carlo samples for polls, using the newly trained model, on the test set. I then count the number of positive-margin Monte Carlo outputs to obtain a probability that the election swings in favor of the first party.

Applying this to a Senate race, I find that the uncertainty in election margin is sizable, around 15%. This is comparable to uncertainties obtained through other aggregation methods, but this shows that it is tough to accurately call close races just based on polls.

Margin predicted (90% CI) by the model for the 2016 presidential election in Pennsylvania. The red line shows the actual margin.

Though this model hasn't learned the relationships between states, I tried applying it to the 2016 presidential election. To get the probability of a candidate winning based on the polls available that day, for each state I run 1000 predictions with different RNG seeds. For each of these 1000 predictions, I add up the electoral votes the candidate would win if they had the predicted margins. The probability of the candidate winning is then the percentage of these outcomes that is below 270.

Histograms of possible presidential election outcomes predicted by the model each day before the election. The outcomes to the left of the red line are cases that result in a Republic victory (the ultimate outcome).

Ultimately, the model showed there was a 93% chance of Clinton winning the election on election day. This is already a more conservative estimate than what some news sources predicted.

The probability of Clinton winning the 2016 election predicted by the model as a function of days before the election.

Unless the 2016 election was a rare event, this shows that clearly, the model is incomplete. Relationships between how states vote compared to polling are crucial to capture. It would also be useful to include more polls in the training set to learn how to aggregate polls more effectively, and in particular, better discern which pollsters are reliable. More features, such as whether the incumbent president is running or if it is an off-year election may also add more information in the predictions. I'll explore some of these ideas in a future blog post.

Code for this blog post is available here.

1. Divide by 365 days to normalize.
2. This is measured in thousands. I normalized using a tanh to get a number between 0 and 1. When not available, this defaults to 0.
3. Though I used Monte Carlo, which seems like a frequentist way to arrive at the confidence interval, it is actually a Bayesian method (see also this paper).
4. There is worry that the number of simulated samples is not sufficient to estimate the percentile when the actual margin is less than the smallest prediction value or more than the largest prediction value. This happened less than 0.1% of the time for most of the choices of dropout rate and is not the primary contributor to the KS test statistic, so it is ignored here.

# Who will win Top Chef Season 14?

Warning: Spoilers ahead if you have not seen the first two episodes of the new season

In the first episode of the season Brooke, after winning the quickfire, claimed she was in a good position because the winner of the first challenge often goes on to win the whole thing. Actually, only one contestant has won the first quickfire and gone on to be top chef (Richard in season 8), and that was a team win. The winner of the first elimination challenge has won the competition 5 of 12 times (not counting season 9 when a whole team won the elimination challenge). This got me wondering if there were other predictors as to who would win Top Chef.

There's not too much data after the first elimination challenge, but I tried building a predictive model using the chef's gender, age, quickfire and elimination performance, and current residence (though I ultimately selected the most predictive features from the list). I used this data as features with a target variable of elimination number to build a gradient-boosted decision tree model to predict when the chefs this season would be eliminated. I validated the model with seasons 12 and 13 and then applied the model to season 14. I looked at the total distance between the predicted and actual placings of the contestants as the metric to optimize during validation. The model predicted both of these seasons correctly, but seasons 12 and 13 were two seasons where the winner of the first elimination challenge became top chef.

The most significant features in predicting the winner were: elimination challenge 1 performance, season (catching general trends across seasons), gender, home state advantage, being from Boston, being from California, and being from Chicago. Male chefs do happen to do better as do chefs from the state where Top Chef is being filmed. Being from Chicago is a little better than being from California, which is better than being from Chicago. To try to visualize this better, I used these significant features and performed a PCA to plot the data in two dimensions. This shows how data clusters, without any knowledge of the ultimate placement of the contestants.

A plot of the PCA components using the key identified features. The colors represent the ultimate position of the contestants. Blue represents more successful contestants where red represents less successful contestants. The $x$ direction corresponds mostly to first elimination success (with more successful contestants on the right), and the $y$ direction corresponds primarily to gender (with male on top). The smaller spreads indicate the other features, such as the contestant's home city. We see that even toward the left there are dark blue points, meaning that nothing is an absolute deal-breaker to become top chef, but of course, winning the first challenge puts contestants in a better position.

My prediction model quite predictably puts Casey as the favorite for winning it all, with Katsuji in second place. The odds are a bit stacked against Casey though. If she were male or from Chicago or if this season's Top Chef were taking place in California, she would have a higher chance of winning. Katsuji's elevated prediction is coming from being on the winning team in the first elimination while being male and from California. He struggled a bit when he was last on the show, though, so I don't know if my personal prediction would put him so high. Brooke, even though she thought she was in a good position this season, is tied for fifth place according to my prediction. My personal prediction would probably put her higher since she did so well in her previous season.

Of course, there's only so much the models can predict. For one thing, there's not enough data to reliably figure out how returning chefs do. This season, it's half new and half old contestants. The model probably learned a bit of this, though, since the experienced chefs won the first elimination challenge, which was included in the model. One thing I thought about adding but didn't was what the chefs actually cooked. Specific ingredients or cooking techniques might be relevant features for the predictive model. However, this data wasn't easy to find without re-watching all the episodes, and given the constraints of all the challenges, I wasn't sure these features would be all that relevant (e.g., season 11 was probably the only time turtle was cooked in an elimination challenge). With more data the model would get better; most winners rack up some wins by the time a few elimination challenges have passed.

The code is available here.

I have been thinking a lot about teaching lately (maybe now that I will not be teaching anymore) and I hope to write a series of a few blog posts about it. My first post here will be on grade inflation, specifically whether curving is an effective way to combat it.

A popular method to combat grade inflation seems to be to impose a set curve for all classes. That is, for example, the top 25% of students get As, the next 35% get Bs and the bottom 40% gets Cs, Ds, and Fs (which is the guideline for my class). While this necessarily avoids the problem of too many people getting As, it can be a bit too rigid, which I will show below.

In the class I teach, there are ~350 students, who are spread among three lectures. I will investigate what effect the splitting of students into the lectures has on their grade. First, I will make an incredibly simple model where I assume there is a "true" ranking of all the students. That is, if all the students were actually in one big class, this would be the ordering of their grades in the course. I will assume that the assessments given to the students in the classes they end up in are completely fair. That is, if their "true" ranking is the highest of anyone in the class, they will get the highest grade in the class and if their "true" ranking is the second highest of anyone in the class they will get the second highest grade and so on. I then assign students randomly to three classes and see how much their percentile in the class fluctuates based on these random choices. This is shown below

The straight black line shows the percentile a student would have gotten had the students been in one large lecture. The black curve above and below it shows the 90% variability in percentile due to random assignment.

We see that even random assignment can cause significant fluctuations, and creates variability particularly for the students in the "middle of the pack." Most students apart from those at the top and bottom could have their letter grade change by a third of a letter grade just due to how the classes were chosen.

Further, this was just assuming the assignment was random. Often, the 8 am lecture has more freshman because they register last and lectures at better times are likely to fill up. There may also be a class that advanced students would sign up for that conflicts with one of the lecture times of my course. This would cause these advanced students to prefer taking the lectures at other times. These effects would only make the story worse from what's shown above.

We have also assumed that each class has the ability to perfectly rank the students from best to lowest. Unfortunately, there is variability in how exam problems are graded and how good questions are at distinguishing students, and so the ranking is not consistent between different lectures. This would tend to randomize positions as well.

Another issue I take with this method of combating grade inflation is that it completely ignores whether students learned anything or not. Since the grading is based on a way to rank students, even if a lecturer is ineffective and thus the students in the course don't learn very much, the student's score will be relatively unchanged. Now, it certainly seems unfair for a student to get a bad grade because their lecturer is poor, but it seems like any top university should not rehire anyone who teaches so poorly that their students learn very little (though I know this is wishful thinking). In particular, an issue here is that how much of the course material students learned is an extremely hard factor to quantify without imposing standards. However, standardized testing leads to ineffective teaching methods (and teaching "to the test") and is clearly not the answer. I'm not aware of a good way to solve this problem, but I think taking data-driven approaches to study this would be extremely useful for education.

In my mind, instead of imposing fixed grade percentages for each of the classes, the grade percentages should be imposed on the course as a whole. That is, in the diagram above, ideally the upper and lower curves would be much closer to the grade in the "true ranking" scenario. Thus, luck or scheduling conflicts have much less of an effect on a students grade. Then the question becomes how to accomplish this. This would mean that sometimes classes would get 40% As and maybe sometimes 15% As, but it would be okay because this is the grade the students should get.

My training in machine learning suggests that bagging would be a great way to reduce the variance. This would mean having three different test problems on each topic and randomly assigning each student one of these three problems. Apart from the logistic nightmare this would bring about, this would really only work when one lecturer is teaching all the classes. For example, if one of the lecturers is much better than another or likes to do problems close to test problems in lecture, then the students will perform better relative to students in other lectures because of their lecturer. To make this work, there needs to be a way to "factor out" the effect of the lecturer.

Another method would be to treat grading more like high school and set rigid grade distributions. The tests would then have to be written in a way such that we'd expect the outcome of the test to follow the guideline grade distributions set by the university, assuming the students in the class follow the general student population. Notably, the test is not written so that the particular course will follow the guideline grade distribution. Of course this is more work than simply writing a test, and certainly, the outcome of a test is hard to estimate. Often I've given tests and been surprised at the outcome, though this is usually due to incomplete information, such as not knowing the instructor did an extremely similar problem as a test problem in class.

One way to implement this would be to look at past tests and look at similar problems, and see how students did on those problems. (Coincidentally, this wasn't possible to do until recently when we started using Gradescope). This gives an idea how we would expect students to perform, and we can use this data to weight the problem appropriately. Of course, we (usually) don't want to give students problems they'll have seen while practicing exams and so it is hard to define how similar a problem is. To do this right requires quite a bit of past data on tests, and as I mentioned earlier this isn't available. Similar problems given by other professors may help, but then we run into the same problem above in that different lecturers will standardize differently from how they decide to teach the course.

Without experimenting with different solutions, it's impossible to figure out what the best solution is, but it seems crazy to accept that curving classes is the best way. Through some work, there could be solutions that encourage student collaboration, reward students for doing their homework (I hope to write more on this in the future) instead of penalizing them for not doing their homework, and take into account how much students are actually learning.

Code for the figure is available here.

# Natural Language Processing and Twitch Chat

This post will be about using natural language processing (NLP) to extract information from Twitch chat. I found the chat log of a popular streamer, and I'll be analyzing one day of chats. I'll keep the streamer anonymous just because I didn't ask his permission to analyze his chat.

On this particular day, the streamer had 88807 messages in his chat with 11312 distinct users chatting. This is an average of about 7.9 chats/user. However, this doesn't mean that most people actually post this much. In fact, 4579 (or about 40%) of users only posted one message. This doesn't take into account the people that never posted, but it shows that it is quite common for users to "lurk," or watch the stream without actively taking part in chat. The distribution of posts per user is shown below:

A histogram of the frequency of number of messages in chat (note the log scale). Almost all of the people in chat post less than 10 posts. The chat bots were not included here, so everyone represented in the plot should be an actual user.

Only 1677 (or about 15%) of users posted 10 or more posts in chat, but they accounted for 65284 messages (about 73.5%). This seems to imply that there may be some form of Pareto principle at work here.

I used tf-idf on the chat log to get a sense for common words and phrases. The tf in tf-idf stands for word frequency, and is, for each chat message, how many times a certain word appears in that chat message [1]. idf stands for inverse document frequency, and is, for each term, the negative log of the fraction of all messages that the term appears in. The idf is an indicator for how much information there is in a word. Common words like "the" and "a" don't carry much information. tf-idf multiplies the two into one index for each term in each chat message. The words with the highest tf-idf are then the most used words in chat. The following table shows some of the common words in chat

 All Chatters >10 Chat Messages One Chat Message lol lol game kappa kappa lol kreygasm kreygasm kappa pogchamp pogchamp kreygasm game game wtf myd kkona stream kkona dansgame followage

The words with highest score under tf-idf for all messages, those who post many messages, and those who only post one message.

Not surprisingly, the people who chat a lot have a similar distribution of words as all the messages (remember, they are about 73.5% of all the messages). Those who only had one message in chat are talking about slightly different things than those who chat a lot. There are a few interesting features, which I will elaborate on below.

myd and followage are bot commands on the streamer's stream. Apparently gimmicks like this are fairly popular, but this means that there are many people chatting without adding content to the stream. It is interesting that those that post more are far less likely to play with these bot commands.

On this day the streamer was playing random games that his subscribers had suggested. This led to weird games and thus many people commented on the game, hence the prevalence of words like "game" and "wtf". People who only post one message seem more likely to complain about the game than those who talk often. For words like this, it could be interesting to see how their prevalence shifts when different games are played.

For those not familiar with Twitch, kappa, kreygasm, pogchamp, kkona, and dansgame are all emotes. Clearly, the use of emotes is quite popular. kkona is an emote on BTTV (a Twitch extension), so it is quite interesting how many people have adopted its use, and this may also indicate why it is more popular with people who post more.

I wanted to see what kind of "conversations" take place in Twitch chat, so I selected for references to other users and then again looked at the most common words under tf-idf. Unfortunately this method will miss many references (e.g. if there were a user who was Nick482392, other people might simply refer to him as Nick) but for an exploratory analysis, it seemed sufficient.

The most referenced person was, predictably, the streamer himself, with 1232 messages mentioning him. The top words for the streamer included "play" with countless suggestions for what other games the streamer should play. During this day, apparently another prominent streamer was talking about the streamer I analyzed, and many people commented on this. There were also many links directed at the streamer. There were no particularly negative words in the most common words directed at the streamer.

I also considered references to other users. There were 4697 of these, though some of these references are simply due to a user having the same name as an emote. Other than the emotes prevalent in general (Kappa, PogChamp), a common word among references was "banned," talking about people who had been banned from talking on the stream by moderators. An interesting thing to look at may also be to look at what kinds of things mods ban for and try to automate some of that process. Another common word was whisper, which was a feature recently added to Twitch. People are at least talking about this feature, which probably means it is getting used as well.

Profanity?

I then looked at all chat messages containing profane words to see if there were trends in how this language was directed. There were 5542 messages that contained profanity, with the most common word being variants of "fuck." The word "game" was often in posts with profanity, which isn't too strange because as mentioned earlier, a lot of people were complaining about the game choice on this day. Other words that were popular in general, such as kappa and kreygasm, were also present in posts with profanity.

The streamer had a visible injury on this day, and there were a few words related to this injury that correlated highly with profanity. These would be messages like "what the hell happened to your arm?" The streamer's name was also quite prevalent in messages that contained profanity.

A little less common than that was a reference to "mods." It seems that people get upset with moderators for banning people and possibly being too harsh. Right below this is "subs," whom there seems to be quite a bit of hostility towards. I'm not sure if this is when subscriber only chat was used, but the use of profanity with "subs" is spread out throughout all of the messages during the day.

There are some profane words that come in bursts (presumably as a reaction to what is happening on the stream). Terms like "sex her" seem to come in bursts, which seems to show some of the more sexist aspects of the Twitch chat ("sex" was a word included as profanity even though it may not qualify as that in all cases).

Conclusions

The ubiquity of emotes on Twitch may be an interesting reason to conduct general NLP research through Twitch chat. Many of these emotes have sentiments or intentions tied to them, and for the most part, people use them for the "right" purpose. For example, Kappa is indicative of sarcasm or someone who is trolling. Sarcasm is notoriously hard for NLP to detect so having a hint like the Kappa emote could reveal general trends in sarcasm [2]. This would be a cool application of machine learning to NLP (maybe a future blog post?).

From a more practical point of view, information like this could be useful to streamers to figure out how they are doing. For example, if a streamer is trying some techniques to get chat more involved, it may be interesting to see if they are successful and they manage to increase the number of chatters with many posts. One thing I didn't consider is how top words change from day-to-day. The game being played and other factors such as recent events may cause these to fluctuate which could be interesting. Of course, more sophisticated analyses can be conducted than looking at top words, for example, looking at the grammar of the messages and seeing what the target of profanity is.

I also just considered one streamer's stream (because I couldn't find many chat logs), and I'm sure it would be interesting to see how other streams differ. The streamer I analyzed is clearly an extremely popular streamer, but it may be interesting to see if the distribution of the engagement level of chatters is different on smaller of streams. It would also be interesting to see if the things said toward female streamers are particularly different than those said to male streamers.

The code I used for this post is available here.

References
1. Yin, D. et al., 2009. Detection of Harassment on Web 2.0. Proceedings of the Content Analysis in the WEB 2.0 (CAW 2.0) Workshop at WWW2009.
2. Gonzalez-Ibanez, R., Muresan, S., and Wacholder, N., 2011. Identifying Sarcasm in Twitter: A Closer Look . Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2, 581-586.

# Amtrak and Survival Analysis

I got the idea for this blog post while waiting ~40 minutes for my Amtrak train the other week. While I use Amtrak a handful of times a year, and generally appreciate it I do find it ridiculous how unreliable its timing can be. (This is mostly due to Amtrak not owning the tracks they use, but I will not get into that here). Most of the delays I've experienced lately haven't been too bad (40 minutes is on the high end), but when I lived in North Carolina and was often taking the Carolinian train from Durham to Charlotte, the story was quite different. I can't remember a single time when the train arrived on time to Durham, and often the delays were over an hour.

This brings me to the point of this post, which is to answer the question, when should I get to the Durham Amtrak station if I want to catch the Carolinian to Charlotte? I'll assume that real-time train delay data isn't available so that past information is all I have to go off of. Certainly if all the trains are actually an hour late, I might as well show up an hour past the scheduled time and I would still always catch the train. Amtrak claims the Carolinian is on time 20-30% of the time, so presumably showing up late would make you miss about this many trains.

Fig. 1: Delay of arrival of the 79 train to the Durham station for each day since 7/4/2006 (with cancelled trains omitted). Note that in the first year and a half of this data, there are almost no trains that arrive on time, but the situation has improved over the years.

All Amtrak arrival data since 7/4/2006 is available on this amazing website. I got all the data available for the 79 train arriving to the Durham station. I've plotted the arrival times during this time in Fig. 1.

A simple frequentist approach

I can consider each train trip as an "experiment" where I sample the distribution of arrival times to the Durham station. The particular train I take is just another experiment, and I would expect it to follow the available distribution of arrivals. Thus, the probability of me missing the train if I arrive $\tau$ minutes after the scheduled time is

Where $N(t>t')$ counts the number of arrivals in the available data where the arrival $t'$ is greater than the specified $\tau.$ The question, then, is how much of the data to include in $N(t>t').$ To test this, I considered a half year's worth of data as a test set. Then, I figured out how much of the previous data I should use as my training set to most accurately capture the delays in the test set. I found that using a year of data prior to the prediction time worked the best. The method is not perfect; the percentage of missed trains predicted using the training set is about 10% off from the number in the test set, as there are year-to-year variations in the data.

A plot of $p(\tau)$ using all the available data and only using the last year of data is shown in Fig. 2. Using only the last year to build the model, to have an 80% chance of making the train, one can show up about 20 minutes after the scheduled time. This also confirms Amtrak's estimate that their trains are on time 20-30% of the time. Even if one shows up an hour after the scheduled time, though, he or she still has a 36% chance of making the train!

Fig. 2: $p(\tau)$ determined using all of the available data (blue) and only the last year of data (gold). I see that for delays longer than 60 minutes, the two curves are similar, indicating that for long waits either prediction method would give similar results. It appears that in the last year the shorter delays have been worse than the long-term average, as there are significant discrepancies in the curves for shorter delays.

A Bayesian approach

With a Bayesian approach, I would like to write down the probability of a delay, $\delta,$ given the data of past arrivals, $\mathcal{D}.$ I will call this $p(\delta|\mathcal{D}).$ Suppose I have a model, characterized by a set of $n$ unknown parameters $\vec{a}$ that describes the probability of delay. I will assume all the important information that can be extracted from $\mathcal{D}$ is contained in $\vec{a}.$ Then, I can decompose the delay distribution as

Using Bayes theorem, $p(\vec{a}|\mathcal{D})$ can then be expressed as

Here, $p(\mathcal{D}|\vec{a})$ is the likelihood function (the model evaluated at all of the data points), $\pi(\vec{a})$ is the prior on the model parameters, and $p(\mathcal{D})$ is the evidence that serves as a normalization factor.I use non-informative priors for $\pi(\vec{a}).$

The question is, then, what the model should be. A priori, I have no reason to suspect any model over another, so I decided to try many and see which one described the data best. To do this, I used the Bayes factor, much like I used previously, with the different models representing different hypotheses. The evidence for a model $\mathcal{M}_1$ is $f$ times greater than the evidence for a model $\mathcal{M}_2$ where

As the models are assumed to depend on parameters $\vec{a}$ (note that a method that does not explicitly have a functional form, such as a machine learning method, could still be used if $p(\mathcal{D}|\mathcal{M})$ could be estimated another way)

Here, $\delta_i$ are all of the delays contained in $\mathcal{D}.$ This integral becomes difficult for large $n$ (even $n=3$ is getting annoying). To make it more tractable, let $l(\vec{a}) = \ln(p(\mathcal{D}|\vec{a})),$ and let $\vec{a}^*$ be the value of the fit parameters that maximize $l(\vec{a}).$ Expanding as a Taylor series gives

where $H$ is the matrix of second derivatives of $l(\vec{a})$ evaluated at $\vec{a}^*.$ The integral can be evaluated using the Laplace approximation, giving

which can now be evaluated by finding a $\vec{a}^*$ that maximizes $p(\mathcal{D}|\vec{a}).$ (Regular priors much be chosen for $\pi(\vec{a}^*|\mathcal{M})$ since I have to evaluate the prior. I will ignore this point here). I tested the exponential, Gompertz, and Gamma/Gompertz distributions, and found under this procedure that the Gamma/Gompertz function described the data the best under this metric. Using this, I explicitly calculate $p(\delta|\mathcal{D}),$ again under the Laplace approximation. This gives the curve shown in Fig. 3, which, as expected, looks quite similar to Fig. 2.

While this section got a bit technical, it confirms the results of the earlier simple analysis. In particular, this predicts that one should show up about 17 minutes after the scheduled arrival time to ensure that he or she will catch 80% of trains, and that one still has 30% chance of catching the train if one shows up an hour late to the Durham station.

Fig. 3: $p(\delta|\mathcal{D})$ calculated using only the last year of data. Note that this curve is quite similar to Fig. 2.

Conclusions

Since I downloaded all the data, 5 days have passed and in that time, the delay of the 79 train has been 22, 40, 51, 45, and 97 minutes. It's a small sample size, but it seems like the prediction that 80% of the time one would be fine showing up 20 minutes late to the Durham station isn't such a bad guideline.

Of course, both of these models described are still incomplete. Especially with the frequentist approach, I have not been careful about statistical uncertainties, and both methods are plagued by systematic uncertainties. One such systematic uncertainties is that all days are not equivalent. Certain days will be more likely to have a delay than other. For example, I am sure the Sunday after Thanksgiving almost always has delays. No such patterns are taken into account in the model, and for a true model of delays these should either be included or the effect of such systematic fluctuations could be characterized.

Scripts for this post are here.

# Predicting Fires

While organizing a data science workshop this summer, I realized that I hadn't ever written a blog post about the data science project I worked on last year. So, this post will be a summary of what I learned from working on the project.

The problem I worked on was Kaggle's Fire Peril Loss Cost. Given over 300 features, including weather and crime, for over one million insurance policies, we wanted to predict how much the insurance company would lose on a policy due to a fire. This is tricky as fires are rare events, and thus almost all policies have no loss.

Features
We first implemented one-hot encoding to turn the categorical variables into boolean arrays indicating whether a certain category applied to a policy or not. This is better than assigning a number in succession to each category (e.g. Category 1=0, Category 2=1, Category 3=2,...) as doing this will assign a hierarchy/metric on the data, which could create spurious relations. One this encoding was done, all of our data was in a numerical form that was amenable for use in machine learning algorithms.

With so many features on so many policies, the entire dataset would not fit in memory on my laptop. Also, we found that not all 300 features were good predictors of the target value. Thus, we spent time selecting the most important features to make predictions.

There were a manageable amount of categorical variables, so we kept all of these, though we did try removing some of them to see if there was any performance benefit from doing so. For each continuous variable, we tried a model where the prediction was simply the value of the variable. Since the evaluation metric (a weighted gini index) only depended on the ordering of the predictions and not the magnitudes, this analysis method was amenable for all of the continuous variables. Notably, we found that one of the variables (var13) was already a reasonably good predictor of the target. We kept the 30 continuous features that scored best under this metric. We chose this selection method over other more common feature selection methods (such as PCA) to avoid some of the stability issues associated with them, but it may have been interesting, given more time, to see how various feature selection methods fared with one another.

Machine Learning
Since only ~0.3% of policies had any loss, we considered using a classifier to first identify the policies with loss. Ideally, the rate of policies with loss identified by this classifier as not having loss would be sufficiently low. Then a regressor could be used on the policies identified by the classifier as having loss, and then the training set would be less singular. We tried a few classifiers, but did not have much success with this approach.

We then tried to use regressors directly. We tried many of the machine learning regressors available in scikit-learn. We found good results from ridge (Tikhonov) regression and gradient boosted decision trees. In the end, we ended up combining the predictions of the two methods, which will be discussed a bit later.

Ridge regression is similar to standard linear regression, but instead of just minimizing the 2-norm of the vector of residuals, there is a penalty term proportional to the two-norm of the vector of coefficients multiplying the features. This penalty for feature coefficients ensures that no feature coefficients become too large. Large feature coefficients could be a sign of a singular prediction matrix and thus could fluctuate wildly. We found through testing that the optimal constant in front of the 2-norm of the vector of feature coefficients was quite small, so the ridge regression was acting quite similarly to linear regression. Note that with ridge regression it is important that all the features are normalized, as not doing this affects the size of the 2-norm of the vector of feature coefficients.

A decision tree is a set of rules (decisions) used to group policies into different classes. These rules are simple ones such as "is var13>0.5?" These rules are chosen at each step so that they best split the set of items (the subsets should mostly all be of the same values). There are different notions for what best means here, but using the gini impurity or information gain (entropy) are common choices. With enough rules, given a grouping, each policy in the group will all have similar target values. A new policy with an unknown target can then be compared against these rules and the target can be predicted to take on the value of the target in the group of policies it ends up with. Note that the depth of the tree (the number of rules) needs to be limited so that the method does not overfit. One could come up with enough rules such that each policy is its own group, but then the method loses predictive power for a new policy with an unknown target value.

On its own, the decision tree is not great at regression. The power of the method comes from the "gradient boosting" part of the name. After a decision tree is created, there will invariably be some policies that are misclassified. In analogy to gradient descent, the decision tree is then then trained to fit the residuals as the new target variable (or more generally the negative gradient of the loss function). This corrects for errors made at each iteration, and after many iterations, makes for quite a robust regressor and classifier.

We got good performance from each of these methods, and the two methods arrive at the prediction in very different ways. Thus, we combine the results from these two machine learning methods to arrive at our final prediction. We considered a standard mean as well as a geometric mean for the final prediction. We found that the geometric combination was more useful. This seems reasonable in predicting rare events as then both methods have to agree that the prediction value is large to net a large prediction, whereas only one method has to have a small prediction value to net a small prediction.

Other Things
We probably could have dealt with missing entries better. It turned out that many of the features were strongly correlated with other features (some even perfectly) so we could have used this information to try to fill in the missing features. Instead, we filled all missing entries with a value of 0. In general, it's probably best to treat missing features as a systematic error and the effect could be quantified through cross-validating by considering various scenarios of filling in missing entries.

It also turned out that some of the features that were labeled as continuous were not actually continuous and were discrete (there were only a few values that the continuous variable took). There may have been some performance benefit from implementing one-hot encoding on these as well.

For the ridge regression, we could have applied standard model selection methods such as AIC and BIC to choose the key features. For the gradient-boosted decision trees, using these methods is a bit trickier as the complexity of the fit is not easy to determine.

One lesson I learned was the importance of cross-validation. $k$-fold cross-validation randomly splits up the training set into $k$ subsets. Then, each subset is used as a test set with the complement used as the training set. This gives an idea of how well the model is expected to perform and also how much the model may be expected to overfit. The cross-validation estimate of the error will be an overestimate of the true prediction error, since only a subset of the data is used for prediction, whereas the whole dataset would be used for a true prediction [1]. Ideally, one is in a regime where this difference is not crucial. By adjusting the value of $k,$ a good regime where this is true can be found.

While we did split up our data set into the halves and test by training on one half and predicting on the other, we could have been more careful about the process. In particular, the feature selection process could have been carefully verified. In addition, we trusted our position on the Kaggle leaderboards more than our cross-validation scores, which led our final predictions to overfit to the leaderboard more than we would have liked.

References:
1. Hastie, T., Tibshirani, R., and Friedman, J. 2009. The Elements of Statistical Learning.

# Resistors and Distance on Graphs

I feel bad for not having written a post in so long! I have been busy with teaching, research, and various other projects. Now that my teaching duties are done, I will try to post more regularly!

A graph is a collection of nodes that are connected by edges that are drawn under a defined condition. These edges may or may not have weights. Graphs are useful representations of data in many scenarios. For example, the Internet is an example of a graph. Each web page would be a node and an edge would be drawn between two nodes if one of the pages links to the other. These edges could be unweighted or could represent the number of links between the two pages. Another example is social media. For example, the nodes could be all users of the service and an edge would be drawn if two nodes are "friends" with each other.

One way to define a distance between two nodes on the graph may be to find the shortest path between the nodes. This would be defined as the collection of edges from one node to the other such that the sum of the weights of the edges (or simply the number of edges in the case of an unweighted graph) is minimized. This is what LinkedIn seems to do when they compute the "degree" of connection of a stranger to you. This is also how Bacon and Erdös numbers are defined.

Fig. 1: Two graphs with the same shortest path distance between A(lice) and B(ob). However, it is clear that in the right graph A(lice) seems better connected to B(ob) than in the left graph.

A shortcoming of this measure, though, is that shortest path ignores how many paths there are from one node to the other. This scenario is depicted in Fig. 1. Suppose in a social network, we would like to determine something about how likely it is that one person (say Alice) will meet another (say Bob) in the future. Suppose Bob is a friend of only one of Alice's friends. Then, given no other information, it seems unlikely that Alice would ever meet Bob, since there is only one avenue for Alice to meet Bob. Of course, this could be different if Bob was good friends with Alice's significant other, so we might want to weight edges if we have information about how close Alice is to her friends in the social network. Now, if Bob is friends with half of Alice's friends, it seems quite likely that when Alice goes to a party with her friends or is hanging out with one of those friends, then Alice will run into Bob. In both of these cases, the shortest path distance between Alice and Bob is the same, but the outcome of how likely they are to meet (which is what we want to analyze) is different.

It turns out that a useful analogy can be made by considering each edge as a resistor in a resistor network. In the unweighted case, each edge is a resistor with resistance 1 (or $1~\Omega$ if having a resistance without units bothers you, though units will be dropped for the rest of the post) and in the weighted case, each edge is a resistor with resistance equal to the inverse weight of the edge. We will see that the effective resistance between two nodes is a good measure of distance that more accurately captures the scenario described above. Groups of nodes with small effective resistance between them will correspond to clusters of people (such as people who work in one workplace) in the social network.

The effective resistance satisfies the triangle inequality, which is a defining property of distances [1]. We can see this as follows. The effective resistance between nodes $a$ and $b$ is the voltage between the two nodes when one unit of current is sent into $a$ and extracted from $b.$ Let the voltage at $b$ be zero (we are always free to set zero potential wherever we like). Then $R_{ab} = v_a = (v_a-v_c)+v_c.$ Now, we know that $v_a-v_c \leq R_{ac},$ since the potential should be maximal at the source (current doesn't climb uphill). Similarly, $v_c\leq R_{cb}$ (remember node $b$ is grounded). Thus, we have that $R_{ab} \leq R_{ac}+R_{cb},$ and the triangle inequality holds. The other defining characteristics hold quite trivially, so the effective resistance is a reasonable way to calculate distance.

Now consider the example presented before and depicted in Fig. 1. If Bob is only friends with one of Alice's friends, and there are no other links between Alice and Bob, then the effective resistance between Alice and Bob is 2. In this case, the effective resistance is the same as the shortest path distance. If Bob is friends with 7 of Alice's friends, and there are no other links between Alice and Bob, the effective resistance between Alice and Bob is 0.29. So, we see that this satisfies the property that having one connection is not as important as having many connections to Bob.

Another interesting consequence of this analogy is related to random walks. Suppose a walker starts at a node $v$, and chooses a random edge connected to that node to walk along (the random choice is weighted by edge weights if the graph is weighted). Then, the expected number of steps to start at $v,$ go to another node $w,$ and return back to $v$ (which is called the commute time) is proportional to the effective resistance between $v$ and $w.$ One way to think about this is that a way to estimate effective resistance, then, on a large resistor network would be through the use of random walks [2]. This is interesting for tricky resistor problems such as the infinite grid of resistors. This also reinforces the idea that effective resistance is a good way of quantifying communities. When one node is well connected with another, it should be relatively easy to commute between the nodes, and thus they should be part of the same community. Further, this notion can be used to place bounds on the maximum possible commute time.

Effective resistance has many other uses as well. Quickly computing effective resistance turns out to be quite useful in speeding up an algorithm that solves the maximum flow problem [3]. In addition, sampling edges by weights proportional to the effective resistance across the edge yields an algorithm that sparsifies graphs. This means that the new graph has fewer nodes and edges than the original graph, but looks "close" to the original graph, and is a more efficient representation of that information [4].

References
1. Lau, L.C., 2015. Lecture notes on electrical networks.
2. Ellens, W., 2011. Effective Resistance.
3. Christiano, P., et al., 2010. Electrical Flows, Laplacian Systems, and Faster Approximation of Maximum Flow in Undirected Graphs.
4. Spielman, D. and Srivastava, N., 2011. Graph Sparsification by Effective Resistances. SIAM J. Comput., 40(6), 1913–1926.

# Random numbers from a distribution and the Lambert W function

When I was making Physibounce, one of the things that took me a while to figure out is something that is unfortunately barely noticeable in gameplay. When the speed of light becomes small, I wanted the speed distribution of particles to be a 2D Maxwell-Juttner distribution rather than the non-relativistic Maxwell speed distribution. The probability density function for particles allowed to be close to the speed of light is:

Where $v$ is the velocity of the particle, $\gamma(v) = (1-\frac{v^2}{c^2})^{-1/2}$ is the Lorentz factor, $k_B$ is the Boltzmann constant, $m$ is the mass of the mass of the particle, $T$ is the temperature of the system, and $c$ is the speed of light. This distribution will look like a Maxwell distribution in the $c\to\infty$ limit, and is normalized so that $\int_0^c f(v) dv = 1.$

The simplest way to generate random numbers following a distribution is the acceptance-rejection method. This is what I do in Physibounce when the speed of light is not changed. However, I was not able to find an expression for a suitable (and efficient) bound (or maximum value) for the distribution given above. Because of this, I decided to use inverse sampling instead. To do this, I first find the cumulative distribution function as

Amazingly, Mathematica was able to do this analytically. By the definition, it is clear $F$ takes on values in the interval $[0,1].$ Thus, if I find a relation that expressed $v$ in terms of $F$ (equivalent to inverting the function), I have an expression that takes uniform values on the interval $[0,1]$ and maps them to a velocity in the interval $[0,c]$ that follows the desired distribution. Mathematica was also able to do the inversion, but the result was in terms of the Lambert W function.

The Lambert W function is the function that satisfies

So it seems reasonable something like this would come up when inverting the cumulative distribution function. An important thing about this function is that it is double valued! There are two branches. When I asked Mathematica to solve the problem, it chose a branch and the branch it chose turned out to be the wrong one. I wondered for a while why I was not getting reasonable velocity values when I plugged in random numbers, and after a bit I realized I needed to choose the other branch.

The next problem I ran into was that I did know of a way to actually evaluate the Lambert W function in actionscript. So I turned to the asymptotic expansions. It turned out I was in a regime where the argument of the Lambert W tended to be large. I took the first few terms in the expansion about $\infty$ for the Lambert W. I verified that over the parameter ranges I could expect in Physibounce that the difference between the Lambert W and the expansion was less than 1%. Since it's pretty hard to look at the speeds of a bunch of objects and determine the distribution of their speeds, I thought this was more than good enough, though I did include a check to make sure no particles were created with speed greater than $c.$ In the end the code ended up only being a few lines, but quite a bit of thought went into each of the lines!

See the ActionScript code here.