deep learning

Optimization Functions in Deep Learning

This will be a short post about optimization functions in deep learning. For simplicity, I'll show how these optimization functions work with a simple 1D example. Suppose that the loss function, L, as a function of weight, w, looks like the following.

The flat region for values of the weight between 1 and 2 is a little contrived, but this will help depict some of the advantages of the more complicated optimization functions. Saddle points are ubiquitous in a deep learning training landscape, so these are valid concerns to consider.

With the standard gradient descent described in a previous post, the optimization needs to entirely avoid the flat region. In this region, the gradient is zero, and thus no update to the weights can be made. As shown below, for positive starting values of the weight, it is hard to avoid the zero gradient region. Even a negative starting value can end up in the flat region if the learning rate is large enough.


Loss as a function of optimization steps starting with a weight of -10 (top) and a weight of 10 (bottom) for different learning rates. Note that the loss flattening out at 10^0 is an artifact of the flat region of the loss function.

Decaying the learning rate as the optimization time increases also doesn't really help here. The biggest problem is the algorithm will get stuck in the flat region, and changing the learning rate won't fix this issue.


This example shows that the crux of the problem is that the update goes straight to zero when a flat region is encountered. The idea behind momentum is to track a momentum variable that updates with each step.

 v_i \to \alpha v_i - \eta \frac{\partial L}{\partial w_i}

 w_i \to w_i+v_i

In these expressions, a subscript denotes the vector component to allow the expressions to generalize to multiple dimensions. There is now a new parameter, \alpha, that tracks how much memory of previous gradients the optimization process should have. Note that when \alpha = 0, this reduces to standard gradient descent.

Implementing this on the function above, we see now that as long as the momentum parameter is large enough, the flat region can be avoided, even with the same learning rate as before. On top of that, the convergence is far faster than any of the choices above.


Loss as a function of optimization steps starting with a weight of 10 for different momentum parameters.


Typical deep learning models have many orders of magnitude more parameters (100000+). Suppose that during the training process, the gradients start out large for some of these variables but not others. There will be a significant update for those variables with large gradients. The gradients of the other parameters might become larger as the training proceeds. However, if the learning rate is decaying through the training process, these updates will not change the parameters as much as if these large gradients had appeared earlier in the training process.

Thus, RMSprop attempts to tune the learning rate for each parameter. If the gradient over the previous steps has been large for that parameter, the learning rate for that parameter will be smaller. This allows for smaller steps to refine the value of the parameter, as discussed in a previous post. Conversely, if the gradient over the last few steps has been small, we may want a larger learning rate to be able to move more in that parameter. We don't want the memory of the gradient sizes to be too long since it seems perverse for the first gradients to affect the 100000th step of training. Thus, a moving average is used to keep track of the gradient magnitudes. The updates of RMSprop are the following.

 r_i \to \rho r_i + (1-\rho)\frac{\partial L}{\partial w_i}\frac{\partial L}{\partial w_i}

 w_i \to w_i-\frac{\eta}{\sqrt{r_i+\delta}}\frac{\partial L}{\partial w_i}

We see that \sqrt{r_i} is a moving average of the gradient magnitude for each parameter. The moving average is characterized by the parameter \rho. \delta is a parameter chosen such that denominator in the fraction does not diverge. \eta, as before is the learning rate of the problem. Global learning rate decay can also be combined with this method to ensure learning rates for each parameter are bounded.

Since the update is always directly proportional to the gradient, for the problem at hand, RMSprop doesn't help and will get stuck much like traditional gradient descent.


If momentum works well, and RMSprop works well, then why not combine both? This is the idea behind Adam. The gradient descent updates of Adam are the following.

v_i \to \alpha v_i + (1-\alpha)\frac{\partial L}{\partial w_i}

 r_i \to \rho r_i + (1-\rho)\frac{\partial L}{\partial w_i}\frac{\partial L}{\partial w_i}

 w_i \to w_i-\frac{\eta}{\sqrt{r_i+\delta}}v_i

The two moving averages, corresponding to the momentum and gradient magnitude, are tracked and used for the update. Adam is my go-to optimizer when training deep learning models. Typically the only parameter that needs tuning is the learning rate, \eta, and the default values of the other parameters (\delta=10^{-8},\alpha=0.9,\rho=0.999) tend to work really well. I applied Adam to the 1D example loss function with these default parameters and got some decent results. This looks a little worse than momentum, but this is because this problem is so simple. For most real-world problems Adam outperforms momentum.

Loss as a function of optimization steps using Adam optimizer with default parameters starting with a weight of 10 for different learning rates.

Who will win Top Chef Season 15?

Warning: Spoilers ahead if you have not seen the first three episodes of the new season

Since I was very wrong last year about who would win Top Chef, I decided to take another stab at it.

This time, I approached it slightly differently. I predicted the win probability after each episode rather than each chef's ultimate placement. I did this since there are not that many relevant features to predict the placement. When I was validating the model on season 14, I found that the model was overfitting when using the location features (e.g., from California), so I decided to leave these features out as well.

The other big change was I decided to use an RNN model to process the sequence of episode performance for each chef. This way, I could train the model on all data, not just data from the first episode, but still, gain useful insights for the first episode.

The output of the RNN was then combined with features for gender, and whether the season had last chance kitchen to get the final prediction. I validated this with season 14. I tuned the training parameters to try to give Brooke the best chance of winning, though it was still challenging as she was eliminated.

Unfortunately, even with the last chance kitchen feature, the model is having a hard time accounting for chefs coming back. This isn't too surprising considering Kristen is the only one who won last chance kitchen and ended up winning the whole thing (until season 14). This is less relevant for predictions after the first two episodes, as a chef eliminated in the first two episodes is extremely unlikely to come back through last chance kitchen. The predictions after the first episode aren't too different from last year's model. Casey is the favorite, though Sheldon comes in as second instead of Katsuji.

After choosing optimal parameters using season 14, I use all of seasons 1-14 to train a model to predict on season 15. With this, after the first episode, I get the following win probabilities for each chef.

Screen Shot 2017-12-25 at 8.35.53 AM
As expected, the winner of the first challenge (Tyler) is the favorite to win. I was delighted to see Tu come in second since I've been to his amazing popups and would love to see him win.

It's taken me a while to actually write this post, and since then, Claudette, Laura, and Rogelio have been eliminated, who were all in the 4% chance of winning crowd, so the model seems to be holding up so far!

What does Kappa mean?

I've written about some basic NLP on twitch chats before. This post is an extension of that, with more data, more sophisticated methods, and hopefully better results!

Word2vec is an algorithm that takes a one-hot encoding of words, which is a sparse, high dimensional representation, and maps it to a dense, low dimensional representation. In this low dimensional word embedding, the distance metric between words represents how likely words are to appear near each other in a sentence. This results in synonyms and related words being close to each other. In fact, misspelled words (which are common in twitch chats) are often the closest synonym for words. The low dimensional representation is learned by training a shallow neural network to predict a word given the words around it. This has some intriguing properties like being good at analogies (which will be discussed later).

So, I took a bunch of twitch chats I could find and trained a word2vec model on it. This was from chats of about 360 streamers over the past four years. Unfortunately, this isn't the most unbiased source of data. Clearly, larger streams have more chatters, and thus their chats will be overrepresented. In addition, the 360 streamers are a small fraction of all streamers on twitch. In fact, none of the streamers I regularly watch had available chat logs.

I did some cleaning and took out single word messages as well as anything twitchnotify and bots said ¹. Even if a chat is actually multiple sentences, I assume that each chat message is a sentence for the word2vec training. I also strip all symbols in messages to ignore any punctuation. This does have the downside of making words such as <3 and 3 equivalent. I chose a context size of 2, and require a word to appear more than 50 times in the corpus to train the word2vec model. I train a 200-dimensional word2vec model.

Once the word2vec is trained, the cosine distance between word vectors can be used to determine their similarity. This showed that the word closest to Kappa was, unsurprisingly, Keepo. This is followed by 4Head and DansGame; also twitch emotes. The closest non-emote word closest to Kappa was lol. This was unsatisfying for me because I feel like information is lost in translation from equating Kappa with lol, but it makes sense. Kappa is likely to appear at the end of sarcastic sentences, and it's quite reasonable for lol to occur in a similar context.

I then looked into analogies. Word2vec has a cool property that by adding and subtracting word vectors, a relationship between two words can be added to another word. The textbook example of this is man+queen-woman=king. That is, the relationship of the difference between queen and woman (royalty) is being added to man to get king. My word2vec model did, in fact, learn this relationship. With some of the game-related analogies, the model fares a bit worse. The expected analogy result was not necessary the closest vector to the vector sums but would be one of the closest.

Screen Shot 2017-12-08 at 9.58.04 PM

The top three closest word vectors to the vector sum (or difference) of word vectors shown. This shows that the models may not learn relationships between words entirely, but is developing a pretty good idea.  

Next, I plotted how words were distributed globally in the word embedding. I used PCA to reduce the 200-dimensional word embedding to 2 dimensions to visualize the relationships. What this showed was that foreign words cluster separately from English words. This makes sense, as it should be rare to combine German words with English ones in the same sentence. Another effect was the commonly used words clustered together, and then there is a region of context-specific words and meme words. Sub emotes are an example of context-specific words, as they are likely to be found only in the chat of one streamer, in which similar chat topics are present. A meme word would be something like the "word" AAAAE-A-A-I-A-U- that usually only appears with the Brain Power meme, and are unlikely to show up in any other context.

Screen Shot 2017-11-16 at 5.49.38 PM

The word embedding of all the words in the corpus, with PCA used to reduce the dimensions from 200 to 2. Each dot represents a word. Natural clusters form in the word embedding.

Zooming into the common words area, the relationships between words become apparent. Most of the global emotes are toward the lower half of the common words range, while games sit on the top half. TriHard is closer to the left, getting close to the context-specific range, which makes sense as while TriHard is a global emote, it's probably used more in TriHex's chat. The politicians cluster together, with Obama closer to Clinton than Sanders or Trump.

Screen Shot 2017-11-18 at 10.04.06 PM

Zooming into the common words part of the previous graph, and visualizing some of the words.

With the success of vector representation of words, a natural extension is a vector representation of chat messages. This can be a useful way to classify similar sentiments or intents between chatters. A simple way to get a vector representation of sentences is to take an average of the word vectors in the sentence in a bag of words approach, ignoring any words that do not have a vector representation. This is a commutative operation, so it is not a perfect representation of a sentence. For example, the sentences "the cat big the dog" and "the dog bit the cat" have different meanings.  However, this is a good starting point to capture the overall intent of sentences.

I cluster the sentences to determine relationships between them. I sampled a set of 1,000,000 chat messages as this was about as much my computer could handle. Typically, I would use DBSCAN to cluster, but as this was computationally prohibitive, I opted for minibatch k-means. I chose the cluster size as 15, and 6 large clusters emerged from the clustering, shown below.

Screen Shot 2017-12-08 at 8.57.40 AM

The sentence vectors of 1,000,000 chat messages, with PCA used to reduce the dimensions from 200 to 2. Each dot now represents a chat message, calculated as a sum of the words vectors in the chat message. The different colors represent clusters found by K-means.

Like the words, sentences containing foreign phrases cluster separately from the rest of sentences. Likewise, chats with sub emotes, and channel specific memes tend to cluster together. Spamming of global emotes was another cluster, and there is reasonably another cluster where global emotes are combined with text chat messages. Chat messages without emotes tend to cluster into two regions: one where the chatter is interacting with the streamer, and another where the chatter is talking about themselves or referencing others in chat (that is the personal pronouns category). These are general trends, and six clusters are not enough to capture all intents of chatters, but this gives a broad idea.

As mentioned earlier, in the context of twitch, this wasn't using that much available data. I'd expect that training on more data might help the model better learn some of the analogies. Another intriguing prospect is how the word embeddings change. I'm sure relationships between words like Clinton and Trump evolved over the course of last year's election. This raises interesting questions about what time period that word2vec should use as a training corpus.

Code for this post is available here.

1. I define bots as tags containing the string 'bot.'


I'm trying to collect some notes on machine learning topics, and so I'm creating a new series of tutorials on this blog. The first subject is autoencoders. Basic autoencoders are simple neural networks, making them a great platform to discuss basics of neural networks. Despite the simplicity, autoencoders have utility in their own right, making them more exciting than a toy example.

Dimensionality Reduction
Autoencoders are a type of dimensionality reduction technique. Dimensionality reduction is used to project vectors in a large space to a smaller space, ideally losing little information in the smaller space. This can be used, for example, for compression, feature extraction, and visualization. In fact, I've used PCA to visualize high dimensional vectors before in this blog.

Depending on the desired properties of the transformation, there are multiple ways to reduce vector dimensions. Common ways I typically use are Johnson-Lindenstrauss, PCA, and t-SNE. My go-to dimensionality reduction algorithm is PCA, so I'll review the general idea behind PCA to offer a comparison point with autoencoders.

Assume that I have a matrix, Z, with dimensions the output vector size by the input vector size. Z will be the decoding matrix. While a decoder is not necessary to encode vectors to the reduced dimension, it will help characterize how much information is lost in the dimensional reduction process. To simplify the problem, I assume that the columns of Z are orthogonal, that is, Z^TZ = I (but ZZ^T\neq I). Let W be the encoding matrix with dimensions input vector size by output vector size. No assumptions will be made on W at this point.

Consider a set of \mathcal{N} input vectors, x^{(n)} for n\in\{1,...\mathcal{N}\}, that will be projected to a lower dimensional space. The vector, \chi^{(n)} in the lower dimensional space is

 \chi^{(n)} = W x^{(n)}.

A faithful representation of the collection \{x^{(n)}\} in the lower dimension will minimize the reconstruction accuracy,

 L = \frac{1}{\mathcal{N}}\sum_{n=1}^\mathcal{N} \| x^{(n)}- ZWx^{(n)}\|_2^2.

The smaller this reconstruction is, the less information that is lost by reducing the dimensions of the input vectors. Let X be the collection of all the \{x^{(n)}\}. Then, the loss can be rewritten as

 L=\frac{1}{\mathcal{N}}\|X-ZWX\|_F^2 = \frac{1}{\mathcal{N}}\mathrm{tr}((X-ZWX)^T(X-ZWX)).

Expanding gives


I used the fact Z^TZ=I to simplify the last term. To minimize the reconstruction error, take the partial derivative with respect to W to get

 \frac{\partial L}{\partial W} = \frac{1}{\mathcal{N}}\left(-2XX^TZ +2XX^TW^T\right)

Setting to zero, I find the optimal solution is W=Z^T. Hence, the optimal solution has the encoder as a transpose of the decoder. Plugging this in, the loss is

 L = \frac{1}{\mathcal{N}}\left(\mathrm{tr}(XX^T)-\mathrm{tr}(ZZ^TXX^T)\right).

Thus, the loss is related to how well the projection matrix ZZ^T maps XX^T to itself. This is of course, non-trivial as ZZ^T is not full rank. Thus, L is minimized when ZZ^T is the best low-rank approximation of XX^T. This is a property of the SVD. Hence, the SVD of XX^T, keeping only the output dimension size of singular values is the optimal dimensionality reduction method, given the assumptions.

Neural Networks
An autoencoder is one way that neural networks can be used as a dimensionality reduction technique. The idea of an autoencoder is to have a neural network learn the identity function. That is, the loss function being optimized is the L^2 distance between the network prediction and the input vector. Note that this is the same loss function as used for PCA. Again, in this case, learning the identity is not trivial as the autoencoder has a hidden layer with much fewer neurons that the input layer. This forces the model to learn to efficiently represent the input vectors with fewer parameters. To actually get an autoencoder, the output layer is stripped so that the learned low dimensional representation can be utilized.

To make this a bit more concrete, consider the problem of reducing 200-dimensional vectors to 2 dimensions. The network to train the autoencoder would look like the image below.

Screen Shot 2017-12-06 at 9.22.05 PMThe architecture of the autoencoder neural network. Each circle on the image represents a neuron. Each of the 200 elements of the input vector x is a neuron on the left side of the image. Each of the two neurons in the center of the image represents a vector of weights. The neuron outputs the dot product of those weights with the input vector and adds a bias. W^{(1)} is the collection of weights from both of these neurons. Typically, the output layer is the desired output of a neural network, but for an autoencoder, the desired output is the hidden layer outputs. The 200 neurons on the right are the output neurons. These also represent a dot product and bias-add operation, with the weights stored in the matrix W^{(2)}. The output neurons results are used to train the network. The output of each layer is represented in o^{(l)}.

Each (non-input) neuron represents a dot product of the weights of that neuron and its inputs, plus a bias. For example, for the hidden layer, this is w\cdot x+b, with w\in\mathbb{R}^{200}, the vector of neuron weights, x\in\mathbb{R}^{200}, the input vector, and b\in\mathbb{R}, the neuron bias. Typically, an activation function¹ is applied to the output of the neuron, but here I will assume a linear activation function for simplicity. The network above has 402 tunable parameters in its hidden layer and 600 parameters in its output layer for a total of 1002 parameters. The goal of training is to choose these parameters such that the network is an effective autoencoder.

Again, consider a set of \mathcal{N} input vectors, x^{(n)} for x\in\{1,...\mathcal{N}\}, that will be projected to a lower dimensional space. I will use notation where subscripts denote indices to the elements of the matrix or vector. Superscripts denote the layer number for weights and outputs (W and o), while they index the training data for data vectors (x and y). Explicitly, in terms of each input vector x^{(n)} and each output vector o^{(2)}, the loss function is

 L = \frac{1}{\mathcal{N}}\sum_{n=1}^{\mathcal{N}}\|x^{(n)}-o^{(2)}\|_2^2 = \frac{1}{\mathcal{N}}\sum_{i=1}^{\mathcal{N}}\|x^{(n)}- W^{(2)}(W^{(1)}x^{(n)}+b^{(1)})-b^{(2)}\|_2^2.

The factor of 1/\mathcal{N} in front makes losses comparable for differently sized training sets. As with PCA, I wish to minimize the loss with respect to the weights.

Once trained, the autoencoded low dimensional representation \chi for a vector x is

 \chi = W^{(1)}x+b^{(1)}.

Note the similarity of this expression to PCA. An autoencoder can be seen as a generalization of PCA in which the restrictions on orthonormal columns of the decoding matrix are lifted.

Unlike PCA, simple calculus cannot be used to obtain the weights that optimize the loss. Thus, I use gradient descent. The gradient of the loss with respect to the weights (W^{(1)}, W^{(2)}, b^{(1)}, and b^{(2)}) gives, locally, the direction of greatest descent in the loss function. Thus, by updating the weights along the gradient, the loss function will move closer to its optimum.

 W^{(i)} \to W^{(i)} - \eta \frac{\partial L}{\partial W^{(i)}} \;\;\;\;\;\;\;\; b^{(i)} \to b^{(i)} - \eta \frac{\partial L}{\partial b^{(i)}}

Here, \eta is the learning rate. If the learning rate is too small, the progress toward the minimum is slow, and thus training takes too long. However, if the learning rate is too large, the steps taken can escape a regime where a linear approximation to the loss may be inappropriate. This can result in the training not converging. Tuning an optimal learning rate is thus important to ensure effective training. In training, the value of \eta is typically decayed so that the training takes large steps toward the optimum initially, but has a small learning rate when the training is refining the weights near the optimum.

Screen Shot 2017-12-06 at 8.52.44 PMTypical behavior of loss as a function of training steps for different values of the learning rate (assuming no learning rate decay). If the learning rate is too high, the loss can diverge. On the other hand, if the learning rate is too low, training will take a long time.

For a simple case of an autoencoder, it is not too hard to explicitly calculate the update for each of the 1002 weights. However, this introduces inefficiencies in the training as the number of layers increases. Consider an entry of the weight matrix W^{(l)}_{ij}. The entry W^{(l)}_{ij} only multiples the jth component of the output of the previous layer, o^{(l-1)}_j, and no other outputs of the previous layer. All dependencies at later layers are entirely contained in the outputs of this layer, o_i^{(l)}. Thus,

 \frac{\partial L}{\partial W^{(l)}_{ij}} = \frac{\partial L}{\partial o_i^{(l)}}\frac{\partial o_i^{(l)}}{\partial W^{(l)}_{ij}} = \frac{\partial L}{\partial o_i^{(l)}} o^{(l-1)}_j

The layer outputs, o^{(l-1)}_j can be cached during the forward pass when the loss functions is calculated. Using the chain rule, I can simplify \partial L/\partial o_i^{(l)} further to get

 \frac{\partial L}{\partial o_i^{(l)}} = \sum_{m=1}^{|l+1|}\frac{\partial L}{\partial o_m^{(l+1)}} \frac{\partial o_m^{(l+1)}}{\partial o_i^{(l)}}= \sum_{m=1}^{|l+1|}\frac{\partial L}{\partial o_m^{(l+1)}} W_{mi}^{(l+1)}.

Here, |l| denotes the number of neurons in the lth layer. This means that the update rule for layer l only depends on pre-computed outputs and weight updates from later layers. By starting at the output layer and backpropagating, all the information is available to compute updates for weights at each layer, as long as the results from the forward and backward propagation are cached. This allows for efficient calculation of weight updates.

The loss function is an average of the RMS loss over all input vectors. By the central limit theorem, as long as input vectors are chosen randomly, a good approximation to the average can be computed without using all the input vectors. Significant time can be saved by updating with these batch gradients. Denote the batch size as \beta. A training epoch consists of calculating \mathcal{N}/\beta gradients on a random ordering of the training set. Conditioned on \mathcal{N}\gg\beta, choosing the batch size as large as the machine memory will allow yields the fastest convergence. In fact, many distributed neural network training systems achieve speedups by increasing the batch size. When memory permits for a \beta that is a sizeable fraction of \mathcal{N} (and getting more data is not an option), the batch size should be tuned to optimize the training.

The autoencoder described here is the most trivial autoencoder. Autoencoders with more hidden layers and other network components are used, but these are beyond the scope of this post.

1. Commonly, relu is used. Unless the network performance needs to be highly optimized, this is usually sufficient to obtain good performance with quick training. In theory, any nonlinear function should be able to achieve good performance.

Making Sense of Polls - Part 1

I'm a huge fan of fivethirtyeight (really I read almost every article written), and I think they do a great job of interpreting poll results. However, I do notice that the process in which they compile the results is tedious. Some of the tediousness, like collecting poll results, is inevitable. However, other work including figuring out which polls are reliable and how to interpret polls across different states seems like it might be automatable. I took a stab at this automation using data that fivethirtyeight compiled, which has gathered polling data for various state and national elections three weeks before the election date. This was about 6200 polls, which is a relatively small sample to train a machine learning model on.

I've identified some key points that need to be figured out for any election forecast:
1) How to combine poll results (assuming all polls are good). This will be some form of a moving average, but what kind of moving average taken is up for debate.
2) How to decide which polls are good. Dishonest polls are a problem. If a pollster is partisan, there needs to be a way to take into account.
3) Estimating the uncertainty in the combined polls. The sampling error is relatively small for most polls, but if pollsters choose not to publish a poll if it disagrees with the conventional wisdom, this can introduce bias. There is also uncertainty about how undecided voters and third-party voters will swing.
4) How to determine correlations in the polls. That is, if a candidate performs worse than the polling average would suggest in Pennsylvania, there is likely to be a similar pattern in Wisconsin.

The last issue was tricky, and will not be covered here, but the first three issues are discussed in this post.

I tackle the problem as a time series prediction problem. That is, given a time series (when the polls happen and their results) I want to predict the outcome of the election. This time series can be interpreted as a sequence of events, which means recurrent neural networks (RNNs) are well-suited to solve the problem. RNNs even handle different-length sequences quite naturally, which is a plus as this is awkward to encode into useful input for other types of models like a tree-based model.

I use the data before 2014 as a training set and 2014 as a validation set. Once I've tuned all the parameters on 2014, I retrain the model with both the training and validation set and predict for 2016 presidential races.

In this post, I tackle the problem of predicting individual races, instead of whole elections (or the entire presidential race, which is equally complex). For each poll, I compile a set of information about the polls:
• How each candidate in the race polled, in order of democrat, republican, and highest polling third party candidate, if available. The ordering is relevant as I account for the political preferences of polls.
• The number of days before the election that the poll was conducted¹.
• The sample size of the poll².
• The partisan leaning of the polling agency, if known
• Whether a live caller was used to conduct the poll (default to false if unknown).
• Whether the poll is an Internet poll (default to false if unknown).
• Whether the pollster is a member of the NCPP, AAPOR, or Roper organizations.
• Of polls conducted before the poll, the percentile of the number of polls the pollster has conducted relative to all other pollsters. The intuition here is that agencies that do one poll are not very reliable, whereas agencies that have done many polls probably are.

All of this information is collected into a single vector, which I will call the poll information. I then take all polls of the same race previous to that poll and make an ordered list of the poll informations, which is the sequence that is the input to the neural network model.

With this as input, I have the neural network predict the ultimate margin of the election. I do not include any sense of "year" as input to the neural network as I wish the model to extrapolate on this variable and hence I do not want the model to overfit to any trends there may be in this variable. I use three LSTM layers with 256 units followed by two fully connected layers with 512 neurons. The output layer is one neuron that predicts the final margin of the election. Mean squared error is used as the loss function. I initialize all weights randomly (using the keras defaults), but there might be a benefit to initialize by transfer learning from an exponentially weighted moving average.

I use dropout at time of prediction as a way to get an estimate of the error in the output of the model. The range where 90% of predictions lie using different RNG seeds for the dropout gives a confidence interval³. To calibrate the amount of dropout to apply after each layer, I trained the model on a training set (polls for elections before 2014) and tested different levels of dropout on the validation set (the 2014 election). I find the percentile of the ground truth election result within the Monte Carlo model predictions. Thus, a perfectly calibrated model would have a uniform distribution of the percentile of ground truth election results within the Monte Carlo model predictions. Of course, I do not expect the model to ever be perfectly calibrated, so I chose the dropout rate that minimized the KS-test statistic with the uniform distribution. This turned out to be 40%, which was comforting as this is a typical choice for dropout at training.


A comparison of the calibration (blue) and ideal (green) CDFs for predictions on the test set. For the calibration curve, the optimal dropout of 40% is used.

I then retrain the model using all data before 2016 (thus using both the training and validation set). I take the calibrated dropout rate and again get Monte Carlo samples for polls, using the newly trained model, on the test set. I then count the number of positive-margin Monte Carlo outputs to obtain a probability that the election swings in favor of the first party.

Applying this to a Senate race, I find that the uncertainty in election margin is sizable, around 15%. This is comparable to uncertainties obtained through other aggregation methods, but this shows that it is tough to accurately call close races just based on polls.


Margin predicted (90% CI) by the model for the 2016 presidential election in Pennsylvania. The red line shows the actual margin.

Though this model hasn't learned the relationships between states, I tried applying it to the 2016 presidential election. To get the probability of a candidate winning based on the polls available that day, for each state I run 1000 predictions with different RNG seeds. For each of these 1000 predictions, I add up the electoral votes the candidate would win if they had the predicted margins. The probability of the candidate winning is then the percentage of these outcomes that is below 270.

Histograms of possible presidential election outcomes predicted by the model each day before the election. The outcomes to the left of the red line are cases that result in a Republic victory (the ultimate outcome).

Ultimately, the model showed there was a 93% chance of Clinton winning the election on election day. This is already a more conservative estimate than what some news sources predicted.


The probability of Clinton winning the 2016 election predicted by the model as a function of days before the election.

Unless the 2016 election was a rare event, this shows that clearly, the model is incomplete. Relationships between how states vote compared to polling are crucial to capture. It would also be useful to include more polls in the training set to learn how to aggregate polls more effectively, and in particular, better discern which pollsters are reliable. More features, such as whether the incumbent president is running or if it is an off-year election may also add more information in the predictions. I'll explore some of these ideas in a future blog post.

Code for this blog post is available here.


1. Divide by 365 days to normalize.
2. This is measured in thousands. I normalized using a tanh to get a number between 0 and 1. When not available, this defaults to 0.
3. Though I used Monte Carlo, which seems like a frequentist way to arrive at the confidence interval, it is actually a Bayesian method (see also this paper).
4. There is worry that the number of simulated samples is not sufficient to estimate the percentile when the actual margin is less than the smallest prediction value or more than the largest prediction value. This happened less than 0.1% of the time for most of the choices of dropout rate and is not the primary contributor to the KS test statistic, so it is ignored here.


Predicting Elections from Pictures

This work was done by the fantastic team I mentored during the CDIPS data science workshop.

I got the idea for this project after reading Subliminal by Leonard Mlodinow. That book cited research suggesting that when people are asked to rate pictures of people based on competency, the average competency score of a candidate is predictive of whether the candidate will win or not. The predictions are correct about 70% of the time for senators and 60% for house members, so while not a reliable indicator, there seems to be some correlation between appearance and winning Senate races. So, we decided to use machine learning to build a model to assess senator faces in the same manner.

This is a challenge as there are only around 30 Senate races every two years, so there's not much data to learn from. We also didn't want to use data that was too old since as trends in hair and fashion change, these probably affect how people perceive competence of people. We ended up using elections from 2007-2014 as training data to predict on the 2016 election. We got the senator images from Wikipedia and Google image search. We used images for the top two finishers, which is usually a democrat and republican. We didn't include other elections since the images were less readily available and we aren't sure if appearing senatorial is the same as appearing presidential (more on that later).

Interpreting Faces
We use a neural network to learn the relationships between pictures and likelihood of winning elections. As input, we provide a senator image with the label of whether the candidate won their race or not. Note that this means that our model predicts the likelihood of winning an election given that the candidate is one of the top two candidates in the election (which is usually apparent beforehand). The model outputs a winning probability for each candidate. To assess the winner of a particular election, we compare the probability of the two candidates and assume the candidate with the higher probability will win.

To cut down on training time, we used relatively shallow neural networks consisting of a few sets of convolutional layers followed by max-pooling layers. After the convolutional layers, we used a fully connected layer before outputting the election win probabilities. Even with these simple networks, there are millions of parameters that must be constrained, which will result in overfitting with the relatively limited number of training images. We apply transformations including rotations, translations, blur, and noise to the images to increase the number of training images to make the training more robust. We also explored transfer learning, where we train the model using a similar problem with more data, and use that as a base network to train the senator model on.

We use keras with the tensorflow backend for training. We performed most of our training on floydhub, which offers reasonably priced resources for deep learning training (though it can be a little bit of a headache to set up).

Model Results
Ultimately, we took three approaches to the problem that proved fruitful:
(I) Direct training on senator images (with the image modifications).
(II) Transfer on senator images from male/female classifier trained on faces in the wild.
(III) Transfer on face images from vgg face (this is a much deeper network than the first two).
We compare and contrast each of these approaches to the problem.

The accuracy in predicting the winners in each state in 2016 for each model were respectively (I) 82%, (II) 74%, and (III) 85%. Interestingly, Florida, Georgia, and Hawaii were races that all the models had difficulty predicting, even though these were all races where the incumbent won. These results make model (III) appear the best, but the number of Senate races in 2016 is small, so these results come with a lot of uncertainty. Further, the training and validation sets are not independent. Many senators running in 2016 were incumbents who had run before, and incumbents usually win reelection, so if the model remembers these senators, it can do relatively well.

Screen Shot 2017-08-19 at 4.42.13 PM

The candidates from Hawaii, Georgia, and Florida that all models struggled with. The models predicted the left candidate to beat the right candidate in each case.

Validating Results
We explored other ways to measure the robustness of our model. First, we had each model score different pictures of the same candidate.

Screen Shot 2017-08-19 at 1.22.32 PMScreen Shot 2017-08-19 at 4.53.34 PM

Scores predicted by each of the models for different pictures of the same candidate. Each row is the prediction of each of the three models.

All of our models have some variability in predictions on pictures of the same candidate so our model may benefit from learning on more varied pictures of candidates. We have to be careful, though, as lesser-known candidates will have fewer pictures and this may bias the training. We also see that for model (III), the Wikipedia pictures actually have the highest score among all of the candidate images. Serious candidates, and in particular incumbents, are more likely to have professional photos and the model may be catching this trait.

We also looked at what features the model learned. First, we looked at how the scores changed when parts of the image were covered up. Intuitively, facial features should contribute most to the trustworthiness of a candidate.

Screen Shot 2017-08-19 at 1.24.10 PM

Scores predicted by each of the models for pictures where part of the image is masked. Each row is the prediction of each of the three models.

We find masking the images wildly changes the prediction for models (I) and (II) but not for model (III). It seems that for this model apparel is more important than facial features, as covering up the tie in the image changed the score more than covering up the face.

We also compared what the first convolutional layer in each of the models learned.

Screen Shot 2017-08-19 at 9.22.41 PM

Samples of the first convolutional layer activations after passing in the image on the far left as visualized by the quiver package. Each row shows the output for each of the three models. We see that each model picks up on similar features in the image.

This confirms that apparel is quite significant, with the candidate's tie and suit being picked up by each of the models. The models do also pick up on some of the edges in facial features as well. A close inspection of the layer output shows that the result of model (III) is cleaner than the other two models in picking up these features.

Given all of these findings, we determine the most robust model is model (III), which was the model that predicted the most 2016 elections correctly as well.

Earlier, we mentioned we trained on senator data because we were not sure whether other elections had similar relationships between face and winning. We tested this hypothesis on the last three presidential elections. This is a limited data set, but we find the model predicts only one of the elections correctly. Since presidential elections are so rare, training a model to predict on presidents is a challenge.

Screen Shot 2017-08-19 at 10.41.20 PM

Model (III) predictions on presidential candidates.

Our models were trained to give a general probability of winning an election. We ignore the fact that senator elections, for the most part, are head to head. There may be benefits from training models to consider the two candidates running for the election and having the model choose the winner. Ultimately, we would want to combine the feature created here with other election metrics including polls. This would be another significant undertaking to figure out how to reliably aggregate results, but this may offer orthogonal insights to methods that are currently used to predict election results.

Check out the code for the project here.