Warning: Spoilers ahead if you have not seen the first three episodes of the new season
Since I was very wrong last year about who would win Top Chef, I decided to take another stab at it.
This time, I approached it slightly differently. I predicted the win probability after each episode rather than each chef's ultimate placement. I did this since there are not that many relevant features to predict the placement. When I was validating the model on season 14, I found that the model was overfitting when using the location features (e.g., from California), so I decided to leave these features out as well.
The other big change was I decided to use an RNN model to process the sequence of episode performance for each chef. This way, I could train the model on all data, not just data from the first episode, but still, gain useful insights for the first episode.
The output of the RNN was then combined with features for gender, and whether the season had last chance kitchen to get the final prediction. I validated this with season 14. I tuned the training parameters to try to give Brooke the best chance of winning, though it was still challenging as she was eliminated.
Unfortunately, even with the last chance kitchen feature, the model is having a hard time accounting for chefs coming back. This isn't too surprising considering Kristen is the only one who won last chance kitchen and ended up winning the whole thing (until season 14). This is less relevant for predictions after the first two episodes, as a chef eliminated in the first two episodes is extremely unlikely to come back through last chance kitchen. The predictions after the first episode aren't too different from last year's model. Casey is the favorite, though Sheldon comes in as second instead of Katsuji.
After choosing optimal parameters using season 14, I use all of seasons 1-14 to train a model to predict on season 15. With this, after the first episode, I get the following win probabilities for each chef.
As expected, the winner of the first challenge (Tyler) is the favorite to win. I was delighted to see Tu come in second since I've been to his amazing popups and would love to see him win.
It's taken me a while to actually write this post, and since then, Claudette, Laura, and Rogelio have been eliminated, who were all in the 4% chance of winning crowd, so the model seems to be holding up so far!
I've written about some basic NLP on twitch chats before. This post is an extension of that, with more data, more sophisticated methods, and hopefully better results!
Word2vec is an algorithm that takes a one-hot encoding of words, which is a sparse, high dimensional representation, and maps it to a dense, low dimensional representation. In this low dimensional word embedding, the distance metric between words represents how likely words are to appear near each other in a sentence. This results in synonyms and related words being close to each other. In fact, misspelled words (which are common in twitch chats) are often the closest synonym for words. The low dimensional representation is learned by training a shallow neural network to predict a word given the words around it. This has some intriguing properties like being good at analogies (which will be discussed later).
So, I took a bunch of twitch chats I could find and trained a word2vec model on it. This was from chats of about 360 streamers over the past four years. Unfortunately, this isn't the most unbiased source of data. Clearly, larger streams have more chatters, and thus their chats will be overrepresented. In addition, the 360 streamers are a small fraction of all streamers on twitch. In fact, none of the streamers I regularly watch had available chat logs.
I did some cleaning and took out single word messages as well as anything twitchnotify and bots said ¹. Even if a chat is actually multiple sentences, I assume that each chat message is a sentence for the word2vec training. I also strip all symbols in messages to ignore any punctuation. This does have the downside of making words such as <3 and 3 equivalent. I chose a context size of 2, and require a word to appear more than 50 times in the corpus to train the word2vec model. I train a 200-dimensional word2vec model.
Once the word2vec is trained, the cosine distance between word vectors can be used to determine their similarity. This showed that the word closest to Kappa was, unsurprisingly, Keepo. This is followed by 4Head and DansGame; also twitch emotes. The closest non-emote word closest to Kappa was lol. This was unsatisfying for me because I feel like information is lost in translation from equating Kappa with lol, but it makes sense. Kappa is likely to appear at the end of sarcastic sentences, and it's quite reasonable for lol to occur in a similar context.
I then looked into analogies. Word2vec has a cool property that by adding and subtracting word vectors, a relationship between two words can be added to another word. The textbook example of this is man+queen-woman=king. That is, the relationship of the difference between queen and woman (royalty) is being added to man to get king. My word2vec model did, in fact, learn this relationship. With some of the game-related analogies, the model fares a bit worse. The expected analogy result was not necessary the closest vector to the vector sums but would be one of the closest.
The top three closest word vectors to the vector sum (or difference) of word vectors shown. This shows that the models may not learn relationships between words entirely, but is developing a pretty good idea.
Next, I plotted how words were distributed globally in the word embedding. I used PCA to reduce the 200-dimensional word embedding to 2 dimensions to visualize the relationships. What this showed was that foreign words cluster separately from English words. This makes sense, as it should be rare to combine German words with English ones in the same sentence. Another effect was the commonly used words clustered together, and then there is a region of context-specific words and meme words. Sub emotes are an example of context-specific words, as they are likely to be found only in the chat of one streamer, in which similar chat topics are present. A meme word would be something like the "word" AAAAE-A-A-I-A-U- that usually only appears with the Brain Power meme, and are unlikely to show up in any other context.
The word embedding of all the words in the corpus, with PCA used to reduce the dimensions from 200 to 2. Each dot represents a word. Natural clusters form in the word embedding.
Zooming into the common words area, the relationships between words become apparent. Most of the global emotes are toward the lower half of the common words range, while games sit on the top half. TriHard is closer to the left, getting close to the context-specific range, which makes sense as while TriHard is a global emote, it's probably used more in TriHex's chat. The politicians cluster together, with Obama closer to Clinton than Sanders or Trump.
Zooming into the common words part of the previous graph, and visualizing some of the words.
With the success of vector representation of words, a natural extension is a vector representation of chat messages. This can be a useful way to classify similar sentiments or intents between chatters. A simple way to get a vector representation of sentences is to take an average of the word vectors in the sentence in a bag of words approach, ignoring any words that do not have a vector representation. This is a commutative operation, so it is not a perfect representation of a sentence. For example, the sentences "the cat big the dog" and "the dog bit the cat" have different meanings. However, this is a good starting point to capture the overall intent of sentences.
I cluster the sentences to determine relationships between them. I sampled a set of 1,000,000 chat messages as this was about as much my computer could handle. Typically, I would use DBSCAN to cluster, but as this was computationally prohibitive, I opted for minibatch k-means. I chose the cluster size as 15, and 6 large clusters emerged from the clustering, shown below.
The sentence vectors of 1,000,000 chat messages, with PCA used to reduce the dimensions from 200 to 2. Each dot now represents a chat message, calculated as a sum of the words vectors in the chat message. The different colors represent clusters found by K-means.
Like the words, sentences containing foreign phrases cluster separately from the rest of sentences. Likewise, chats with sub emotes, and channel specific memes tend to cluster together. Spamming of global emotes was another cluster, and there is reasonably another cluster where global emotes are combined with text chat messages. Chat messages without emotes tend to cluster into two regions: one where the chatter is interacting with the streamer, and another where the chatter is talking about themselves or referencing others in chat (that is the personal pronouns category). These are general trends, and six clusters are not enough to capture all intents of chatters, but this gives a broad idea.
As mentioned earlier, in the context of twitch, this wasn't using that much available data. I'd expect that training on more data might help the model better learn some of the analogies. Another intriguing prospect is how the word embeddings change. I'm sure relationships between words like Clinton and Trump evolved over the course of last year's election. This raises interesting questions about what time period that word2vec should use as a training corpus.
Warning: Spoilers ahead if you have not seen the first two episodes of the new season
In the first episode of the season Brooke, after winning the quickfire, claimed she was in a good position because the winner of the first challenge often goes on to win the whole thing. Actually, only one contestant has won the first quickfire and gone on to be top chef (Richard in season 8), and that was a team win. The winner of the first elimination challenge has won the competition 5 of 12 times (not counting season 9 when a whole team won the elimination challenge). This got me wondering if there were other predictors as to who would win Top Chef.
There's not too much data after the first elimination challenge, but I tried building a predictive model using the chef's gender, age, quickfire and elimination performance, and current residence (though I ultimately selected the most predictive features from the list). I used this data as features with a target variable of elimination number to build a gradient-boosted decision tree model to predict when the chefs this season would be eliminated. I validated the model with seasons 12 and 13 and then applied the model to season 14. I looked at the total distance between the predicted and actual placings of the contestants as the metric to optimize during validation. The model predicted both of these seasons correctly, but seasons 12 and 13 were two seasons where the winner of the first elimination challenge became top chef.
The most significant features in predicting the winner were: elimination challenge 1 performance, season (catching general trends across seasons), gender, home state advantage, being from Boston, being from California, and being from Chicago. Male chefs do happen to do better as do chefs from the state where Top Chef is being filmed. Being from Chicago is a little better than being from California, which is better than being from Chicago. To try to visualize this better, I used these significant features and performed a PCA to plot the data in two dimensions. This shows how data clusters, without any knowledge of the ultimate placement of the contestants.
A plot of the PCA components using the key identified features. The colors represent the ultimate position of the contestants. Blue represents more successful contestants where red represents less successful contestants. The direction corresponds mostly to first elimination success (with more successful contestants on the right), and the direction corresponds primarily to gender (with male on top). The smaller spreads indicate the other features, such as the contestant's home city. We see that even toward the left there are dark blue points, meaning that nothing is an absolute deal-breaker to become top chef, but of course, winning the first challenge puts contestants in a better position.
My prediction model quite predictably puts Casey as the favorite for winning it all, with Katsuji in second place. The odds are a bit stacked against Casey though. If she were male or from Chicago or if this season's Top Chef were taking place in California, she would have a higher chance of winning. Katsuji's elevated prediction is coming from being on the winning team in the first elimination while being male and from California. He struggled a bit when he was last on the show, though, so I don't know if my personal prediction would put him so high. Brooke, even though she thought she was in a good position this season, is tied for fifth place according to my prediction. My personal prediction would probably put her higher since she did so well in her previous season.
Of course, there's only so much the models can predict. For one thing, there's not enough data to reliably figure out how returning chefs do. This season, it's half new and half old contestants. The model probably learned a bit of this, though, since the experienced chefs won the first elimination challenge, which was included in the model. One thing I thought about adding but didn't was what the chefs actually cooked. Specific ingredients or cooking techniques might be relevant features for the predictive model. However, this data wasn't easy to find without re-watching all the episodes, and given the constraints of all the challenges, I wasn't sure these features would be all that relevant (e.g., season 11 was probably the only time turtle was cooked in an elimination challenge). With more data the model would get better; most winners rack up some wins by the time a few elimination challenges have passed.