Word2vec is an algorithm that takes a one-hot encoding of words, which is a sparse, high dimensional representation, and maps it to a dense, low dimensional representation. In this low dimensional word embedding, the distance metric between words represents how likely words are to appear near each other in a sentence. This results in synonyms and related words being close to each other. In fact, misspelled words (which are common in twitch chats) are often the closest synonym for words. The low dimensional representation is learned by training a shallow neural network to predict a word given the words around it. This has some intriguing properties like being good at analogies (which will be discussed later).
So, I took a bunch of twitch chats I could find and trained a word2vec model on it. This was from chats of about 360 streamers over the past four years. Unfortunately, this isn't the most unbiased source of data. Clearly, larger streams have more chatters, and thus their chats will be overrepresented. In addition, the 360 streamers are a small fraction of all streamers on twitch. In fact, none of the streamers I regularly watch had available chat logs.
I did some cleaning and took out single word messages as well as anything twitchnotify and bots said ¹. Even if a chat is actually multiple sentences, I assume that each chat message is a sentence for the word2vec training. I also strip all symbols in messages to ignore any punctuation. This does have the downside of making words such as <3 and 3 equivalent. I chose a context size of 2, and require a word to appear more than 50 times in the corpus to train the word2vec model. I train a 200-dimensional word2vec model.
Once the word2vec is trained, the cosine distance between word vectors can be used to determine their similarity. This showed that the word closest to Kappa was, unsurprisingly, Keepo. This is followed by 4Head and DansGame; also twitch emotes. The closest non-emote word closest to Kappa was lol. This was unsatisfying for me because I feel like information is lost in translation from equating Kappa with lol, but it makes sense. Kappa is likely to appear at the end of sarcastic sentences, and it's quite reasonable for lol to occur in a similar context.
I then looked into analogies. Word2vec has a cool property that by adding and subtracting word vectors, a relationship between two words can be added to another word. The textbook example of this is man+queen-woman=king. That is, the relationship of the difference between queen and woman (royalty) is being added to man to get king. My word2vec model did, in fact, learn this relationship. With some of the game-related analogies, the model fares a bit worse. The expected analogy result was not necessary the closest vector to the vector sums but would be one of the closest.
The top three closest word vectors to the vector sum (or difference) of word vectors shown. This shows that the models may not learn relationships between words entirely, but is developing a pretty good idea.
Next, I plotted how words were distributed globally in the word embedding. I used PCA to reduce the 200-dimensional word embedding to 2 dimensions to visualize the relationships. What this showed was that foreign words cluster separately from English words. This makes sense, as it should be rare to combine German words with English ones in the same sentence. Another effect was the commonly used words clustered together, and then there is a region of context-specific words and meme words. Sub emotes are an example of context-specific words, as they are likely to be found only in the chat of one streamer, in which similar chat topics are present. A meme word would be something like the "word" AAAAE-A-A-I-A-U- that usually only appears with the Brain Power meme, and are unlikely to show up in any other context.
The word embedding of all the words in the corpus, with PCA used to reduce the dimensions from 200 to 2. Each dot represents a word. Natural clusters form in the word embedding.
Zooming into the common words area, the relationships between words become apparent. Most of the global emotes are toward the lower half of the common words range, while games sit on the top half. TriHard is closer to the left, getting close to the context-specific range, which makes sense as while TriHard is a global emote, it's probably used more in TriHex's chat. The politicians cluster together, with Obama closer to Clinton than Sanders or Trump.
Zooming into the common words part of the previous graph, and visualizing some of the words.
With the success of vector representation of words, a natural extension is a vector representation of chat messages. This can be a useful way to classify similar sentiments or intents between chatters. A simple way to get a vector representation of sentences is to take an average of the word vectors in the sentence in a bag of words approach, ignoring any words that do not have a vector representation. This is a commutative operation, so it is not a perfect representation of a sentence. For example, the sentences "the cat big the dog" and "the dog bit the cat" have different meanings. However, this is a good starting point to capture the overall intent of sentences.
I cluster the sentences to determine relationships between them. I sampled a set of 1,000,000 chat messages as this was about as much my computer could handle. Typically, I would use DBSCAN to cluster, but as this was computationally prohibitive, I opted for minibatch k-means. I chose the cluster size as 15, and 6 large clusters emerged from the clustering, shown below.
The sentence vectors of 1,000,000 chat messages, with PCA used to reduce the dimensions from 200 to 2. Each dot now represents a chat message, calculated as a sum of the words vectors in the chat message. The different colors represent clusters found by K-means.
Like the words, sentences containing foreign phrases cluster separately from the rest of sentences. Likewise, chats with sub emotes, and channel specific memes tend to cluster together. Spamming of global emotes was another cluster, and there is reasonably another cluster where global emotes are combined with text chat messages. Chat messages without emotes tend to cluster into two regions: one where the chatter is interacting with the streamer, and another where the chatter is talking about themselves or referencing others in chat (that is the personal pronouns category). These are general trends, and six clusters are not enough to capture all intents of chatters, but this gives a broad idea.
As mentioned earlier, in the context of twitch, this wasn't using that much available data. I'd expect that training on more data might help the model better learn some of the analogies. Another intriguing prospect is how the word embeddings change. I'm sure relationships between words like Clinton and Trump evolved over the course of last year's election. This raises interesting questions about what time period that word2vec should use as a training corpus.
Code for this post is available here.