# Optimization Functions in Deep Learning

This will be a short post about optimization functions in deep learning. For simplicity, I'll show how these optimization functions work with a simple 1D example. Suppose that the loss function, $L,$ as a function of weight, $w,$ looks like the following.

The flat region for values of the weight between 1 and 2 is a little contrived, but this will help depict some of the advantages of the more complicated optimization functions. Saddle points are ubiquitous in a deep learning training landscape, so these are valid concerns to consider.

With the standard gradient descent described in a previous post, the optimization needs to entirely avoid the flat region. In this region, the gradient is zero, and thus no update to the weights can be made. As shown below, for positive starting values of the weight, it is hard to avoid the zero gradient region. Even a negative starting value can end up in the flat region if the learning rate is large enough.

Loss as a function of optimization steps starting with a weight of -10 (top) and a weight of 10 (bottom) for different learning rates. Note that the loss flattening out at $10^0$ is an artifact of the flat region of the loss function.

Decaying the learning rate as the optimization time increases also doesn't really help here. The biggest problem is the algorithm will get stuck in the flat region, and changing the learning rate won't fix this issue.

Momentum

This example shows that the crux of the problem is that the update goes straight to zero when a flat region is encountered. The idea behind momentum is to track a momentum variable that updates with each step.

In these expressions, a subscript denotes the vector component to allow the expressions to generalize to multiple dimensions. There is now a new parameter, $\alpha,$ that tracks how much memory of previous gradients the optimization process should have. Note that when $\alpha = 0,$ this reduces to standard gradient descent.

Implementing this on the function above, we see now that as long as the momentum parameter is large enough, the flat region can be avoided, even with the same learning rate as before. On top of that, the convergence is far faster than any of the choices above.

Loss as a function of optimization steps starting with a weight of 10 for different momentum parameters.

RMSprop

Typical deep learning models have many orders of magnitude more parameters (100000+). Suppose that during the training process, the gradients start out large for some of these variables but not others. There will be a significant update for those variables with large gradients. The gradients of the other parameters might become larger as the training proceeds. However, if the learning rate is decaying through the training process, these updates will not change the parameters as much as if these large gradients had appeared earlier in the training process.

Thus, RMSprop attempts to tune the learning rate for each parameter. If the gradient over the previous steps has been large for that parameter, the learning rate for that parameter will be smaller. This allows for smaller steps to refine the value of the parameter, as discussed in a previous post. Conversely, if the gradient over the last few steps has been small, we may want a larger learning rate to be able to move more in that parameter. We don't want the memory of the gradient sizes to be too long since it seems perverse for the first gradients to affect the 100000th step of training. Thus, a moving average is used to keep track of the gradient magnitudes. The updates of RMSprop are the following.

We see that $\sqrt{r_i}$ is a moving average of the gradient magnitude for each parameter. The moving average is characterized by the parameter $\rho.$ $\delta$ is a parameter chosen such that denominator in the fraction does not diverge. $\eta,$ as before is the learning rate of the problem. Global learning rate decay can also be combined with this method to ensure learning rates for each parameter are bounded.

Since the update is always directly proportional to the gradient, for the problem at hand, RMSprop doesn't help and will get stuck much like traditional gradient descent.

If momentum works well, and RMSprop works well, then why not combine both? This is the idea behind Adam. The gradient descent updates of Adam are the following.

The two moving averages, corresponding to the momentum and gradient magnitude, are tracked and used for the update. Adam is my go-to optimizer when training deep learning models. Typically the only parameter that needs tuning is the learning rate, $\eta,$ and the default values of the other parameters ($\delta=10^{-8},\alpha=0.9,\rho=0.999$) tend to work really well. I applied Adam to the 1D example loss function with these default parameters and got some decent results. This looks a little worse than momentum, but this is because this problem is so simple. For most real-world problems Adam outperforms momentum.

Loss as a function of optimization steps using Adam optimizer with default parameters starting with a weight of 10 for different learning rates.

# Who will win Top Chef Season 15?

Warning: Spoilers ahead if you have not seen the first three episodes of the new season

Since I was very wrong last year about who would win Top Chef, I decided to take another stab at it.

This time, I approached it slightly differently. I predicted the win probability after each episode rather than each chef's ultimate placement. I did this since there are not that many relevant features to predict the placement. When I was validating the model on season 14, I found that the model was overfitting when using the location features (e.g., from California), so I decided to leave these features out as well.

The other big change was I decided to use an RNN model to process the sequence of episode performance for each chef. This way, I could train the model on all data, not just data from the first episode, but still, gain useful insights for the first episode.

The output of the RNN was then combined with features for gender, and whether the season had last chance kitchen to get the final prediction. I validated this with season 14. I tuned the training parameters to try to give Brooke the best chance of winning, though it was still challenging as she was eliminated.

Unfortunately, even with the last chance kitchen feature, the model is having a hard time accounting for chefs coming back. This isn't too surprising considering Kristen is the only one who won last chance kitchen and ended up winning the whole thing (until season 14). This is less relevant for predictions after the first two episodes, as a chef eliminated in the first two episodes is extremely unlikely to come back through last chance kitchen. The predictions after the first episode aren't too different from last year's model. Casey is the favorite, though Sheldon comes in as second instead of Katsuji.

After choosing optimal parameters using season 14, I use all of seasons 1-14 to train a model to predict on season 15. With this, after the first episode, I get the following win probabilities for each chef.

As expected, the winner of the first challenge (Tyler) is the favorite to win. I was delighted to see Tu come in second since I've been to his amazing popups and would love to see him win.

It's taken me a while to actually write this post, and since then, Claudette, Laura, and Rogelio have been eliminated, who were all in the 4% chance of winning crowd, so the model seems to be holding up so far!

# What does Kappa mean?

I've written about some basic NLP on twitch chats before. This post is an extension of that, with more data, more sophisticated methods, and hopefully better results!

Word2vec is an algorithm that takes a one-hot encoding of words, which is a sparse, high dimensional representation, and maps it to a dense, low dimensional representation. In this low dimensional word embedding, the distance metric between words represents how likely words are to appear near each other in a sentence. This results in synonyms and related words being close to each other. In fact, misspelled words (which are common in twitch chats) are often the closest synonym for words. The low dimensional representation is learned by training a shallow neural network to predict a word given the words around it. This has some intriguing properties like being good at analogies (which will be discussed later).

So, I took a bunch of twitch chats I could find and trained a word2vec model on it. This was from chats of about 360 streamers over the past four years. Unfortunately, this isn't the most unbiased source of data. Clearly, larger streams have more chatters, and thus their chats will be overrepresented. In addition, the 360 streamers are a small fraction of all streamers on twitch. In fact, none of the streamers I regularly watch had available chat logs.

I did some cleaning and took out single word messages as well as anything twitchnotify and bots said ¹. Even if a chat is actually multiple sentences, I assume that each chat message is a sentence for the word2vec training. I also strip all symbols in messages to ignore any punctuation. This does have the downside of making words such as <3 and 3 equivalent. I chose a context size of 2, and require a word to appear more than 50 times in the corpus to train the word2vec model. I train a 200-dimensional word2vec model.

Once the word2vec is trained, the cosine distance between word vectors can be used to determine their similarity. This showed that the word closest to Kappa was, unsurprisingly, Keepo. This is followed by 4Head and DansGame; also twitch emotes. The closest non-emote word closest to Kappa was lol. This was unsatisfying for me because I feel like information is lost in translation from equating Kappa with lol, but it makes sense. Kappa is likely to appear at the end of sarcastic sentences, and it's quite reasonable for lol to occur in a similar context.

I then looked into analogies. Word2vec has a cool property that by adding and subtracting word vectors, a relationship between two words can be added to another word. The textbook example of this is man+queen-woman=king. That is, the relationship of the difference between queen and woman (royalty) is being added to man to get king. My word2vec model did, in fact, learn this relationship. With some of the game-related analogies, the model fares a bit worse. The expected analogy result was not necessary the closest vector to the vector sums but would be one of the closest.

The top three closest word vectors to the vector sum (or difference) of word vectors shown. This shows that the models may not learn relationships between words entirely, but is developing a pretty good idea.

Next, I plotted how words were distributed globally in the word embedding. I used PCA to reduce the 200-dimensional word embedding to 2 dimensions to visualize the relationships. What this showed was that foreign words cluster separately from English words. This makes sense, as it should be rare to combine German words with English ones in the same sentence. Another effect was the commonly used words clustered together, and then there is a region of context-specific words and meme words. Sub emotes are an example of context-specific words, as they are likely to be found only in the chat of one streamer, in which similar chat topics are present. A meme word would be something like the "word" AAAAE-A-A-I-A-U- that usually only appears with the Brain Power meme, and are unlikely to show up in any other context.

The word embedding of all the words in the corpus, with PCA used to reduce the dimensions from 200 to 2. Each dot represents a word. Natural clusters form in the word embedding.

Zooming into the common words area, the relationships between words become apparent. Most of the global emotes are toward the lower half of the common words range, while games sit on the top half. TriHard is closer to the left, getting close to the context-specific range, which makes sense as while TriHard is a global emote, it's probably used more in TriHex's chat. The politicians cluster together, with Obama closer to Clinton than Sanders or Trump.

Zooming into the common words part of the previous graph, and visualizing some of the words.

With the success of vector representation of words, a natural extension is a vector representation of chat messages. This can be a useful way to classify similar sentiments or intents between chatters. A simple way to get a vector representation of sentences is to take an average of the word vectors in the sentence in a bag of words approach, ignoring any words that do not have a vector representation. This is a commutative operation, so it is not a perfect representation of a sentence. For example, the sentences "the cat big the dog" and "the dog bit the cat" have different meanings.  However, this is a good starting point to capture the overall intent of sentences.

I cluster the sentences to determine relationships between them. I sampled a set of 1,000,000 chat messages as this was about as much my computer could handle. Typically, I would use DBSCAN to cluster, but as this was computationally prohibitive, I opted for minibatch k-means. I chose the cluster size as 15, and 6 large clusters emerged from the clustering, shown below.

The sentence vectors of 1,000,000 chat messages, with PCA used to reduce the dimensions from 200 to 2. Each dot now represents a chat message, calculated as a sum of the words vectors in the chat message. The different colors represent clusters found by K-means.

Like the words, sentences containing foreign phrases cluster separately from the rest of sentences. Likewise, chats with sub emotes, and channel specific memes tend to cluster together. Spamming of global emotes was another cluster, and there is reasonably another cluster where global emotes are combined with text chat messages. Chat messages without emotes tend to cluster into two regions: one where the chatter is interacting with the streamer, and another where the chatter is talking about themselves or referencing others in chat (that is the personal pronouns category). These are general trends, and six clusters are not enough to capture all intents of chatters, but this gives a broad idea.

As mentioned earlier, in the context of twitch, this wasn't using that much available data. I'd expect that training on more data might help the model better learn some of the analogies. Another intriguing prospect is how the word embeddings change. I'm sure relationships between words like Clinton and Trump evolved over the course of last year's election. This raises interesting questions about what time period that word2vec should use as a training corpus.

Code for this post is available here.

1. I define bots as tags containing the string 'bot.'

# Autoencoders

I'm trying to collect some notes on machine learning topics, and so I'm creating a new series of tutorials on this blog. The first subject is autoencoders. Basic autoencoders are simple neural networks, making them a great platform to discuss basics of neural networks. Despite the simplicity, autoencoders have utility in their own right, making them more exciting than a toy example.

Dimensionality Reduction
Autoencoders are a type of dimensionality reduction technique. Dimensionality reduction is used to project vectors in a large space to a smaller space, ideally losing little information in the smaller space. This can be used, for example, for compression, feature extraction, and visualization. In fact, I've used PCA to visualize high dimensional vectors before in this blog.

Depending on the desired properties of the transformation, there are multiple ways to reduce vector dimensions. Common ways I typically use are Johnson-Lindenstrauss, PCA, and t-SNE. My go-to dimensionality reduction algorithm is PCA, so I'll review the general idea behind PCA to offer a comparison point with autoencoders.

Assume that I have a matrix, $Z,$ with dimensions the output vector size by the input vector size. $Z$ will be the decoding matrix. While a decoder is not necessary to encode vectors to the reduced dimension, it will help characterize how much information is lost in the dimensional reduction process. To simplify the problem, I assume that the columns of $Z$ are orthogonal, that is, $Z^TZ = I$ (but $ZZ^T\neq I$). Let $W$ be the encoding matrix with dimensions input vector size by output vector size. No assumptions will be made on $W$ at this point.

Consider a set of $\mathcal{N}$ input vectors, $x^{(n)}$ for $n\in\{1,...\mathcal{N}\},$ that will be projected to a lower dimensional space. The vector, $\chi^{(n)}$ in the lower dimensional space is

A faithful representation of the collection $\{x^{(n)}\}$ in the lower dimension will minimize the reconstruction accuracy,

The smaller this reconstruction is, the less information that is lost by reducing the dimensions of the input vectors. Let $X$ be the collection of all the $\{x^{(n)}\}.$ Then, the loss can be rewritten as

Expanding gives

I used the fact $Z^TZ=I$ to simplify the last term. To minimize the reconstruction error, take the partial derivative with respect to $W$ to get

Setting to zero, I find the optimal solution is $W=Z^T.$ Hence, the optimal solution has the encoder as a transpose of the decoder. Plugging this in, the loss is

Thus, the loss is related to how well the projection matrix $ZZ^T$ maps $XX^T$ to itself. This is of course, non-trivial as $ZZ^T$ is not full rank. Thus, $L$ is minimized when $ZZ^T$ is the best low-rank approximation of $XX^T.$ This is a property of the SVD. Hence, the SVD of $XX^T,$ keeping only the output dimension size of singular values is the optimal dimensionality reduction method, given the assumptions.

Neural Networks
An autoencoder is one way that neural networks can be used as a dimensionality reduction technique. The idea of an autoencoder is to have a neural network learn the identity function. That is, the loss function being optimized is the $L^2$ distance between the network prediction and the input vector. Note that this is the same loss function as used for PCA. Again, in this case, learning the identity is not trivial as the autoencoder has a hidden layer with much fewer neurons that the input layer. This forces the model to learn to efficiently represent the input vectors with fewer parameters. To actually get an autoencoder, the output layer is stripped so that the learned low dimensional representation can be utilized.

To make this a bit more concrete, consider the problem of reducing 200-dimensional vectors to 2 dimensions. The network to train the autoencoder would look like the image below.

The architecture of the autoencoder neural network. Each circle on the image represents a neuron. Each of the 200 elements of the input vector $x$ is a neuron on the left side of the image. Each of the two neurons in the center of the image represents a vector of weights. The neuron outputs the dot product of those weights with the input vector and adds a bias. $W^{(1)}$ is the collection of weights from both of these neurons. Typically, the output layer is the desired output of a neural network, but for an autoencoder, the desired output is the hidden layer outputs. The 200 neurons on the right are the output neurons. These also represent a dot product and bias-add operation, with the weights stored in the matrix $W^{(2)}$. The output neurons results are used to train the network. The output of each layer is represented in $o^{(l)}.$

Each (non-input) neuron represents a dot product of the weights of that neuron and its inputs, plus a bias. For example, for the hidden layer, this is $w\cdot x+b,$ with $w\in\mathbb{R}^{200},$ the vector of neuron weights, $x\in\mathbb{R}^{200},$ the input vector, and $b\in\mathbb{R},$ the neuron bias. Typically, an activation function¹ is applied to the output of the neuron, but here I will assume a linear activation function for simplicity. The network above has 402 tunable parameters in its hidden layer and 600 parameters in its output layer for a total of 1002 parameters. The goal of training is to choose these parameters such that the network is an effective autoencoder.

Again, consider a set of $\mathcal{N}$ input vectors, $x^{(n)}$ for $x\in\{1,...\mathcal{N}\},$ that will be projected to a lower dimensional space. I will use notation where subscripts denote indices to the elements of the matrix or vector. Superscripts denote the layer number for weights and outputs ($W$ and $o$), while they index the training data for data vectors ($x$ and $y$). Explicitly, in terms of each input vector $x^{(n)}$ and each output vector $o^{(2)},$ the loss function is

The factor of $1/\mathcal{N}$ in front makes losses comparable for differently sized training sets. As with PCA, I wish to minimize the loss with respect to the weights.

Once trained, the autoencoded low dimensional representation $\chi$ for a vector $x$ is

Note the similarity of this expression to PCA. An autoencoder can be seen as a generalization of PCA in which the restrictions on orthonormal columns of the decoding matrix are lifted.

Unlike PCA, simple calculus cannot be used to obtain the weights that optimize the loss. Thus, I use gradient descent. The gradient of the loss with respect to the weights ($W^{(1)},$ $W^{(2)},$ $b^{(1)},$ and $b^{(2)}$) gives, locally, the direction of greatest descent in the loss function. Thus, by updating the weights along the gradient, the loss function will move closer to its optimum.

Here, $\eta$ is the learning rate. If the learning rate is too small, the progress toward the minimum is slow, and thus training takes too long. However, if the learning rate is too large, the steps taken can escape a regime where a linear approximation to the loss may be inappropriate. This can result in the training not converging. Tuning an optimal learning rate is thus important to ensure effective training. In training, the value of $\eta$ is typically decayed so that the training takes large steps toward the optimum initially, but has a small learning rate when the training is refining the weights near the optimum.

Typical behavior of loss as a function of training steps for different values of the learning rate (assuming no learning rate decay). If the learning rate is too high, the loss can diverge. On the other hand, if the learning rate is too low, training will take a long time.

For a simple case of an autoencoder, it is not too hard to explicitly calculate the update for each of the 1002 weights. However, this introduces inefficiencies in the training as the number of layers increases. Consider an entry of the weight matrix $W^{(l)}_{ij}.$ The entry $W^{(l)}_{ij}$ only multiples the $j$th component of the output of the previous layer, $o^{(l-1)}_j,$ and no other outputs of the previous layer. All dependencies at later layers are entirely contained in the outputs of this layer, $o_i^{(l)}.$ Thus,

The layer outputs, $o^{(l-1)}_j$ can be cached during the forward pass when the loss functions is calculated. Using the chain rule, I can simplify $\partial L/\partial o_i^{(l)}$ further to get

Here, $|l|$ denotes the number of neurons in the $l$th layer. This means that the update rule for layer $l$ only depends on pre-computed outputs and weight updates from later layers. By starting at the output layer and backpropagating, all the information is available to compute updates for weights at each layer, as long as the results from the forward and backward propagation are cached. This allows for efficient calculation of weight updates.

The loss function is an average of the RMS loss over all input vectors. By the central limit theorem, as long as input vectors are chosen randomly, a good approximation to the average can be computed without using all the input vectors. Significant time can be saved by updating with these batch gradients. Denote the batch size as $\beta.$ A training epoch consists of calculating $\mathcal{N}/\beta$ gradients on a random ordering of the training set. Conditioned on $\mathcal{N}\gg\beta,$ choosing the batch size as large as the machine memory will allow yields the fastest convergence. In fact, many distributed neural network training systems achieve speedups by increasing the batch size. When memory permits for a $\beta$ that is a sizeable fraction of $\mathcal{N}$ (and getting more data is not an option), the batch size should be tuned to optimize the training.

The autoencoder described here is the most trivial autoencoder. Autoencoders with more hidden layers and other network components are used, but these are beyond the scope of this post.

1. Commonly, relu is used. Unless the network performance needs to be highly optimized, this is usually sufficient to obtain good performance with quick training. In theory, any nonlinear function should be able to achieve good performance.

# Making Sense of Polls - Part 1

I'm a huge fan of fivethirtyeight (really I read almost every article written), and I think they do a great job of interpreting poll results. However, I do notice that the process in which they compile the results is tedious. Some of the tediousness, like collecting poll results, is inevitable. However, other work including figuring out which polls are reliable and how to interpret polls across different states seems like it might be automatable. I took a stab at this automation using data that fivethirtyeight compiled, which has gathered polling data for various state and national elections three weeks before the election date. This was about 6200 polls, which is a relatively small sample to train a machine learning model on.

I've identified some key points that need to be figured out for any election forecast:
1) How to combine poll results (assuming all polls are good). This will be some form of a moving average, but what kind of moving average taken is up for debate.
2) How to decide which polls are good. Dishonest polls are a problem. If a pollster is partisan, there needs to be a way to take into account.
3) Estimating the uncertainty in the combined polls. The sampling error is relatively small for most polls, but if pollsters choose not to publish a poll if it disagrees with the conventional wisdom, this can introduce bias. There is also uncertainty about how undecided voters and third-party voters will swing.
4) How to determine correlations in the polls. That is, if a candidate performs worse than the polling average would suggest in Pennsylvania, there is likely to be a similar pattern in Wisconsin.

The last issue was tricky, and will not be covered here, but the first three issues are discussed in this post.

I tackle the problem as a time series prediction problem. That is, given a time series (when the polls happen and their results) I want to predict the outcome of the election. This time series can be interpreted as a sequence of events, which means recurrent neural networks (RNNs) are well-suited to solve the problem. RNNs even handle different-length sequences quite naturally, which is a plus as this is awkward to encode into useful input for other types of models like a tree-based model.

I use the data before 2014 as a training set and 2014 as a validation set. Once I've tuned all the parameters on 2014, I retrain the model with both the training and validation set and predict for 2016 presidential races.

In this post, I tackle the problem of predicting individual races, instead of whole elections (or the entire presidential race, which is equally complex). For each poll, I compile a set of information about the polls:
• How each candidate in the race polled, in order of democrat, republican, and highest polling third party candidate, if available. The ordering is relevant as I account for the political preferences of polls.
• The number of days before the election that the poll was conducted¹.
• The sample size of the poll².
• The partisan leaning of the polling agency, if known
• Whether a live caller was used to conduct the poll (default to false if unknown).
• Whether the poll is an Internet poll (default to false if unknown).
• Whether the pollster is a member of the NCPP, AAPOR, or Roper organizations.
• Of polls conducted before the poll, the percentile of the number of polls the pollster has conducted relative to all other pollsters. The intuition here is that agencies that do one poll are not very reliable, whereas agencies that have done many polls probably are.

All of this information is collected into a single vector, which I will call the poll information. I then take all polls of the same race previous to that poll and make an ordered list of the poll informations, which is the sequence that is the input to the neural network model.

With this as input, I have the neural network predict the ultimate margin of the election. I do not include any sense of "year" as input to the neural network as I wish the model to extrapolate on this variable and hence I do not want the model to overfit to any trends there may be in this variable. I use three LSTM layers with 256 units followed by two fully connected layers with 512 neurons. The output layer is one neuron that predicts the final margin of the election. Mean squared error is used as the loss function. I initialize all weights randomly (using the keras defaults), but there might be a benefit to initialize by transfer learning from an exponentially weighted moving average.

I use dropout at time of prediction as a way to get an estimate of the error in the output of the model. The range where 90% of predictions lie using different RNG seeds for the dropout gives a confidence interval³. To calibrate the amount of dropout to apply after each layer, I trained the model on a training set (polls for elections before 2014) and tested different levels of dropout on the validation set (the 2014 election). I find the percentile of the ground truth election result within the Monte Carlo model predictions. Thus, a perfectly calibrated model would have a uniform distribution of the percentile of ground truth election results within the Monte Carlo model predictions. Of course, I do not expect the model to ever be perfectly calibrated, so I chose the dropout rate that minimized the KS-test statistic with the uniform distribution. This turned out to be 40%, which was comforting as this is a typical choice for dropout at training.

A comparison of the calibration (blue) and ideal (green) CDFs for predictions on the test set. For the calibration curve, the optimal dropout of 40% is used.

I then retrain the model using all data before 2016 (thus using both the training and validation set). I take the calibrated dropout rate and again get Monte Carlo samples for polls, using the newly trained model, on the test set. I then count the number of positive-margin Monte Carlo outputs to obtain a probability that the election swings in favor of the first party.

Applying this to a Senate race, I find that the uncertainty in election margin is sizable, around 15%. This is comparable to uncertainties obtained through other aggregation methods, but this shows that it is tough to accurately call close races just based on polls.

Margin predicted (90% CI) by the model for the 2016 presidential election in Pennsylvania. The red line shows the actual margin.

Though this model hasn't learned the relationships between states, I tried applying it to the 2016 presidential election. To get the probability of a candidate winning based on the polls available that day, for each state I run 1000 predictions with different RNG seeds. For each of these 1000 predictions, I add up the electoral votes the candidate would win if they had the predicted margins. The probability of the candidate winning is then the percentage of these outcomes that is below 270.

Histograms of possible presidential election outcomes predicted by the model each day before the election. The outcomes to the left of the red line are cases that result in a Republic victory (the ultimate outcome).

Ultimately, the model showed there was a 93% chance of Clinton winning the election on election day. This is already a more conservative estimate than what some news sources predicted.

The probability of Clinton winning the 2016 election predicted by the model as a function of days before the election.

Unless the 2016 election was a rare event, this shows that clearly, the model is incomplete. Relationships between how states vote compared to polling are crucial to capture. It would also be useful to include more polls in the training set to learn how to aggregate polls more effectively, and in particular, better discern which pollsters are reliable. More features, such as whether the incumbent president is running or if it is an off-year election may also add more information in the predictions. I'll explore some of these ideas in a future blog post.

Code for this blog post is available here.

1. Divide by 365 days to normalize.
2. This is measured in thousands. I normalized using a tanh to get a number between 0 and 1. When not available, this defaults to 0.
3. Though I used Monte Carlo, which seems like a frequentist way to arrive at the confidence interval, it is actually a Bayesian method (see also this paper).
4. There is worry that the number of simulated samples is not sufficient to estimate the percentile when the actual margin is less than the smallest prediction value or more than the largest prediction value. This happened less than 0.1% of the time for most of the choices of dropout rate and is not the primary contributor to the KS test statistic, so it is ignored here.

# Predicting Elections from Pictures

This work was done by the fantastic team I mentored during the CDIPS data science workshop.

I got the idea for this project after reading Subliminal by Leonard Mlodinow. That book cited research suggesting that when people are asked to rate pictures of people based on competency, the average competency score of a candidate is predictive of whether the candidate will win or not. The predictions are correct about 70% of the time for senators and 60% for house members, so while not a reliable indicator, there seems to be some correlation between appearance and winning Senate races. So, we decided to use machine learning to build a model to assess senator faces in the same manner.

This is a challenge as there are only around 30 Senate races every two years, so there's not much data to learn from. We also didn't want to use data that was too old since as trends in hair and fashion change, these probably affect how people perceive competence of people. We ended up using elections from 2007-2014 as training data to predict on the 2016 election. We got the senator images from Wikipedia and Google image search. We used images for the top two finishers, which is usually a democrat and republican. We didn't include other elections since the images were less readily available and we aren't sure if appearing senatorial is the same as appearing presidential (more on that later).

Interpreting Faces
We use a neural network to learn the relationships between pictures and likelihood of winning elections. As input, we provide a senator image with the label of whether the candidate won their race or not. Note that this means that our model predicts the likelihood of winning an election given that the candidate is one of the top two candidates in the election (which is usually apparent beforehand). The model outputs a winning probability for each candidate. To assess the winner of a particular election, we compare the probability of the two candidates and assume the candidate with the higher probability will win.

To cut down on training time, we used relatively shallow neural networks consisting of a few sets of convolutional layers followed by max-pooling layers. After the convolutional layers, we used a fully connected layer before outputting the election win probabilities. Even with these simple networks, there are millions of parameters that must be constrained, which will result in overfitting with the relatively limited number of training images. We apply transformations including rotations, translations, blur, and noise to the images to increase the number of training images to make the training more robust. We also explored transfer learning, where we train the model using a similar problem with more data, and use that as a base network to train the senator model on.

We use keras with the tensorflow backend for training. We performed most of our training on floydhub, which offers reasonably priced resources for deep learning training (though it can be a little bit of a headache to set up).

Model Results
Ultimately, we took three approaches to the problem that proved fruitful:
(I) Direct training on senator images (with the image modifications).
(II) Transfer on senator images from male/female classifier trained on faces in the wild.
(III) Transfer on face images from vgg face (this is a much deeper network than the first two).
We compare and contrast each of these approaches to the problem.

The accuracy in predicting the winners in each state in 2016 for each model were respectively (I) 82%, (II) 74%, and (III) 85%. Interestingly, Florida, Georgia, and Hawaii were races that all the models had difficulty predicting, even though these were all races where the incumbent won. These results make model (III) appear the best, but the number of Senate races in 2016 is small, so these results come with a lot of uncertainty. Further, the training and validation sets are not independent. Many senators running in 2016 were incumbents who had run before, and incumbents usually win reelection, so if the model remembers these senators, it can do relatively well.

The candidates from Hawaii, Georgia, and Florida that all models struggled with. The models predicted the left candidate to beat the right candidate in each case.

Validating Results
We explored other ways to measure the robustness of our model. First, we had each model score different pictures of the same candidate.

Scores predicted by each of the models for different pictures of the same candidate. Each row is the prediction of each of the three models.

All of our models have some variability in predictions on pictures of the same candidate so our model may benefit from learning on more varied pictures of candidates. We have to be careful, though, as lesser-known candidates will have fewer pictures and this may bias the training. We also see that for model (III), the Wikipedia pictures actually have the highest score among all of the candidate images. Serious candidates, and in particular incumbents, are more likely to have professional photos and the model may be catching this trait.

We also looked at what features the model learned. First, we looked at how the scores changed when parts of the image were covered up. Intuitively, facial features should contribute most to the trustworthiness of a candidate.

Scores predicted by each of the models for pictures where part of the image is masked. Each row is the prediction of each of the three models.

We find masking the images wildly changes the prediction for models (I) and (II) but not for model (III). It seems that for this model apparel is more important than facial features, as covering up the tie in the image changed the score more than covering up the face.

We also compared what the first convolutional layer in each of the models learned.

Samples of the first convolutional layer activations after passing in the image on the far left as visualized by the quiver package. Each row shows the output for each of the three models. We see that each model picks up on similar features in the image.

This confirms that apparel is quite significant, with the candidate's tie and suit being picked up by each of the models. The models do also pick up on some of the edges in facial features as well. A close inspection of the layer output shows that the result of model (III) is cleaner than the other two models in picking up these features.

Given all of these findings, we determine the most robust model is model (III), which was the model that predicted the most 2016 elections correctly as well.

Conclusion
Earlier, we mentioned we trained on senator data because we were not sure whether other elections had similar relationships between face and winning. We tested this hypothesis on the last three presidential elections. This is a limited data set, but we find the model predicts only one of the elections correctly. Since presidential elections are so rare, training a model to predict on presidents is a challenge.

Model (III) predictions on presidential candidates.

Our models were trained to give a general probability of winning an election. We ignore the fact that senator elections, for the most part, are head to head. There may be benefits from training models to consider the two candidates running for the election and having the model choose the winner. Ultimately, we would want to combine the feature created here with other election metrics including polls. This would be another significant undertaking to figure out how to reliably aggregate results, but this may offer orthogonal insights to methods that are currently used to predict election results.

Check out the code for the project here.

For some background, there is no way I would have known in college that I would not have wanted to apply for postdocs after my Ph.D. My career plan then was to go through grad school, be successful, and someday end up a professor. During my Masters' degree, I realized that this may not be the life I wanted. My advisers at the time (who were married to each other) would routinely be at the institute until 8 or 9 in the evening and would come in early in the morning. I often wondered if they discussed much else than physics at home. During my Ph.D., I also observed my advisers working long hours and working on weekends and holidays. Once my advisor even told me that anyone I (a physicist) dated should expect for me to be unavailable on weekends and holidays when there was important physics to do. Still, I was in too deep at this point, so I determined the minimal amount of work I needed to get a Ph.D. and completed that. Now that I have a job as a data scientist, the benefits of my Ph.D. seem to be the people I met and the connections I made (which did help me get the job), but the actual knowledge I gained during the degree has been mostly useless. Looking back, I'm reminded of some of the critical issues with UC Berkeley and academia and have outlined them here.

For some reason, during grad school, you are expected to volunteer your time, with no pay or credit. This is especially apparent during the summer when my contract said I was supposed to work 19 hours/week, but my adviser expected me to come in 40+ hours/week. What also shocked me is when I told fellow grad students about this issue, they were not even aware they were only supposed to be working halftime. Further, if an adviser does not have funding and the student teaches during summer to cover costs, the student has no obligation to do research during that summer, but this isn't communicated to the student. These facts are rarely spelled out. The sad thing is, this doesn't end with grad school. I know post-docs (at my institution and others) who have told me that their contracts say they should work around 40 hours/week, but are routinely actually expected to work 50-60 hours/week.

My last semester in grad school I was not enrolled in classes, not getting paid for research and was just working on my thesis. Hence, there was no obligation for me to do anything that was not for my benefit. Still, my advisers tried to guilt me into coming into the office often (with threats of not signing my thesis) and to continue doing work. It still bothers me I paid the university (for full disclosure, I made plenty of money during a summer internship to cover these costs) for the "opportunity" to work for the research group for that semester.

When research is done for course credit, the lines are blurred a bit. The time spent on the "course" is not necessarily fixed. I would think that if it is a "course," then research obligations related to the course should start when the semester starts and end when the semester ends. Certainly, the "course" should not require a student to attend a meeting on a university holiday or weekend (which happened to me during grad school). Most universities have standards and expect professors teaching courses to be available for their students. Some graduate students I know talk to their advisors a couple of times a semester which I would think hardly respects these standards. This has also led me to question what can and cannot be asked of a student in a course. For example, if there were a "course" in T-shirt making that made students work in sweat-shop like conditions to get a grade in a class, would this be legal? While not a tangible product, research for credit is a somewhat similar scenario where the students are producing papers that will ultimately benefit the professor's fame (and a small chance of advancing the student's career).

Another issue is that, as budgets get tighter, expenses get passed off to students. During my time at Berkeley, keeping pens and paper in a storeroom was deemed too expensive. The department suggested each research group make their own purchases of these items through the purchasing website. Not only is this a waste of graduate student's time, but the website was so terrible that often it was easier just to buy items and not get reimbursed for them. Most research is done on student's personal laptops, and while a necessity to continue work, rarely is their support from the university to make this purchase or pay for maintenance when it is used for research work. There is no IT staff, so again it falls on students and post-docs to waste their time dealing with network issues and computer outages rather than focusing on the work that is actually interesting.

I apparently have a roughly 50% chance of "making it" as a professor, which is mostly because I attended a good institution and published in a journal with a high impact factor. Ph.D. exit surveys have found similar rates for the fraction of students that stay in academia. Yet, the general expectation in academia is that all of the students will go on to have a research-focused career (I know some professors who look down upon a teaching-focused career as well even though this is still technically academia).

Even after making it extremely clear to my adviser that I had no intentions of pursuing a postdoc, he told me that I should think about applying. He went so far as to say data science (my chosen profession) was a fad and would probably die out in a few years. Once during his class, he seemed quite proud of the fact that between industry and academic jobs, most of his students had wound up staying in physics. While my adviser wasn't too unhappy with me taking courses unrelated to research, many advisers will actively encourage their students to focus on research and not take classes. This is terrible advice, considering useful skills in computer science and math are often crucial to getting jobs outside of academia.

The physics curriculum, in general, is flawed. At no point in a typical undergraduate/graduate curriculum are there courses on asymptotic analysis, algorithms, numerical methods, or rigorous statistics, which are all useful both inside and outside of physics. Often these are assumed known or trivial, yet this gives physicists a poor foundation and can lead to problems when working on relevant problems. I took classes on all of these topics, though they were optional, and they are proving to be more useful to me than most of the physics courses I took or even research I conducted as a graduate student.

This is a problem at the institutional level as well. For physics graduate students at UC Berkeley, the qualifying exam is set up to test students on topics of the student's choosing. I chose numerical methods and statistics as my main topic, but my committee asked me no questions on numerical methods and statistics. To be fair, one member tried, but he didn't know what to ask, but then again, he could have prepared something to ask since the topics are announced months before the exam. Instead, my committee asked me general plasma physics and quantum mechanics questions which were not topics I had chosen. I failed to answer those questions (and have even less ability to solve it now), but somehow still passed the exam. This convinced me that the exam was nothing more than a formality, yet no one was willing to make changes to make the exam more useful. An easy fix would be to frame the oral exam as interview practice, as most graduate students have no experience with interviews.

There are many student-led efforts to try to make transitioning into a non-academic job easier at UC Berkeley. But the issue is, for the most part, they are student-led and have little faculty support. I was very involved in these groups, and it was quite clear that my adviser did not want me to be involved. Considering that involvement is why I have a job right now, I would say I made the right choice.

As I alluded to before, even though I was receiving no money or course credit from my research group in my last semester, my advisers threatened to not sign my thesis unless I completed various research tasks (some unrelated to the actual thesis). I've talked to others that have had similar experiences, and I hate to say I'm confident that this extends beyond research tasks in some research groups. This could be solved by making the thesis review process anonymous and have the advisers take no part in it, but there seem to be no efforts to make this happen.

Another fault is that advisers can get rid of students on a whim. I've known people who have sunk three years into a research group only to be told that they cannot continue. Then, the student has to make a decision to start from zero and spend a ridiculous amount of their life in grad school or leave without a degree rendering the three years in the research group irrelevant. Because of this power, professors can make their students work long hours and come in on weekends and holidays. If the student does not oblige, the time the student has already put into the research group is just wasted time.

It doesn't help that research advisers are usually also respected members of their research area. That is, if a student decides to stay in academia (particularly in the same research area as grad school), the word of their adviser could make or break their career. This again gives an opportunity for advisers to request favors from students.

Further, the university has every motive to protect professors, especially those with tenure, but not graduate students. This became embarrassingly clear, for example, in how the university handles Geoff Marcy before Buzzfeed got a hold of the news. If professors can ignore rules set by the University (and sometimes laws) with little repercussion, there is little faith graduate students can have that new rules the university creates to any of the problems mentioned here will be followed. I don't want to belittle the (mostly student-led) efforts to make sexual harassment less of an issue at UC Berkeley, but ultimately the only change I can pinpoint is that there is now more sexual harassment training, which has been shown by research at UC Berkeley to lead to more sexual harassment incidents.

Ultimately, graduate school and a career in academia work for some people. Some people are passionate about science and love working on their problems, even if that means making a few sacrifices. Progression of science is a noble, necessary goal, and I am glad there are people out there to make it happen. My hope is that many of the problems mentioned here can be rectified so that the experience for those people and also the people who realize that academia isn't for them can succeed on a different path.

I've been reading a lot during my commutes this year, and I thought I'd summarize some of my thoughts about the books and blogs I've read and how much I enjoyed them.

Subliminal: How Your Unconscious Mind Rules Your Behavior by Leonard Mlodinow - 4/5
This was a fun read. Lots of examples of how unconscious decisions are actually more prevalent than most people realize. This was particularly interesting as I've been thinking more about unconscious bias lately. This book got me thinking about data-driven approaches to quantify bias (both conscious and unconscious), but it is tricky to define the correct loss function to train this model.

But What If We're Wrong? by Chuck Klosterman - 2/5
Not sure if I got the point of this book (to be fair, the book did warn me that this might happen at the beginning). I thought the book was about politics (the back of the book mentions the president), but what I remember of it was mostly about pop culture and philosophy. Still, it's good to be reminded periodically to try to think how others feel in a two-sided political situation.

Dune by Frank Herbert - 1/5
Couldn't really get into this book. I thought the world was not particularly exciting and apart from a few events early on, I found the story a bit slow.

Data for the People by Andreas Weigend - 5/5
Full disclosure, Andreas is a friend of mine, so it's hard for me to be objective about this book. I think Andreas discusses provocative ideas on the trade that users have with a company, that is giving up some of their privacy in exchange for better services from the company. I worry though that the ideas are hard to implement without an external body to enforce it. There are many fun stories about data to tie all of these ideas together.

Hillbilly Elegy by J.D. Vance - 3/5
As the author of this book is a successful lawyer who grew up relatively poor in the rust belt, the author is the ideal person to translate the ideals of the working class to city dwellers. This book is mostly about the life of the author but touches on religion, family values, and the hillbilly lifestyle that contributed to his worldview. I felt like this definitely helps contextualize some of the voting patterns that I see in the U.S.

Weapons of Math Destruction by Cathy O'Neil - 4/5
Definitely some useful lessons on how defining metrics is vital and how, without retraining, models can get gamed and not serve their intended purpose. I feel like there were a lot of ideas and guidelines for preventing dangerous models, but without an enforcement mechanism, I'm unsure if any of the ideas can come to fruition. I'm always advocating for data-driven solutions to problems, but this book has made me consider that models have to be cautious, especially when biases can be involved.

Wait but Why Blog by Tim Urban - 4/5
This blog offers in-depth analyses of tech ideas like AI and Elon Musk's companies. The author goes deeply into the details about each topic to provide a comprehensive view of the topic. The only downside is that sometimes I feel like all the criticisms of the topic are not adequately addressed.

# Who will win Top Chef Season 14?

Warning: Spoilers ahead if you have not seen the first two episodes of the new season

In the first episode of the season Brooke, after winning the quickfire, claimed she was in a good position because the winner of the first challenge often goes on to win the whole thing. Actually, only one contestant has won the first quickfire and gone on to be top chef (Richard in season 8), and that was a team win. The winner of the first elimination challenge has won the competition 5 of 12 times (not counting season 9 when a whole team won the elimination challenge). This got me wondering if there were other predictors as to who would win Top Chef.

There's not too much data after the first elimination challenge, but I tried building a predictive model using the chef's gender, age, quickfire and elimination performance, and current residence (though I ultimately selected the most predictive features from the list). I used this data as features with a target variable of elimination number to build a gradient-boosted decision tree model to predict when the chefs this season would be eliminated. I validated the model with seasons 12 and 13 and then applied the model to season 14. I looked at the total distance between the predicted and actual placings of the contestants as the metric to optimize during validation. The model predicted both of these seasons correctly, but seasons 12 and 13 were two seasons where the winner of the first elimination challenge became top chef.

The most significant features in predicting the winner were: elimination challenge 1 performance, season (catching general trends across seasons), gender, home state advantage, being from Boston, being from California, and being from Chicago. Male chefs do happen to do better as do chefs from the state where Top Chef is being filmed. Being from Chicago is a little better than being from California, which is better than being from Chicago. To try to visualize this better, I used these significant features and performed a PCA to plot the data in two dimensions. This shows how data clusters, without any knowledge of the ultimate placement of the contestants.

A plot of the PCA components using the key identified features. The colors represent the ultimate position of the contestants. Blue represents more successful contestants where red represents less successful contestants. The $x$ direction corresponds mostly to first elimination success (with more successful contestants on the right), and the $y$ direction corresponds primarily to gender (with male on top). The smaller spreads indicate the other features, such as the contestant's home city. We see that even toward the left there are dark blue points, meaning that nothing is an absolute deal-breaker to become top chef, but of course, winning the first challenge puts contestants in a better position.

My prediction model quite predictably puts Casey as the favorite for winning it all, with Katsuji in second place. The odds are a bit stacked against Casey though. If she were male or from Chicago or if this season's Top Chef were taking place in California, she would have a higher chance of winning. Katsuji's elevated prediction is coming from being on the winning team in the first elimination while being male and from California. He struggled a bit when he was last on the show, though, so I don't know if my personal prediction would put him so high. Brooke, even though she thought she was in a good position this season, is tied for fifth place according to my prediction. My personal prediction would probably put her higher since she did so well in her previous season.

Of course, there's only so much the models can predict. For one thing, there's not enough data to reliably figure out how returning chefs do. This season, it's half new and half old contestants. The model probably learned a bit of this, though, since the experienced chefs won the first elimination challenge, which was included in the model. One thing I thought about adding but didn't was what the chefs actually cooked. Specific ingredients or cooking techniques might be relevant features for the predictive model. However, this data wasn't easy to find without re-watching all the episodes, and given the constraints of all the challenges, I wasn't sure these features would be all that relevant (e.g., season 11 was probably the only time turtle was cooked in an elimination challenge). With more data the model would get better; most winners rack up some wins by the time a few elimination challenges have passed.

The code is available here.

# Election Thoughts

This election was anomalous in many ways. The approval ratings of both candidates were historically low. Perhaps related, third-party candidates were garnering much more support than usual. The nationwide polling of Gary Johnson was close to 5% and Evan McMullin was polling close to 30% in Utah. There's never really been a candidate without a political history who has gotten the presidential nomination of a major party and there's never been a female candidate who has gotten the presidential nomination of a major party.

These anomalies certainly make statistical predictions more difficult. We'd expect that a candidate might perform similarly to past candidates with similar approval, similar ideologies, or similar polling trends, but there were no similar candidates. We have to assume that the trends that carried over in past, very different elections apply to this one, and presumably, this is why so many of the election prediction models were misguided.

I have a few thoughts I wanted to write out. I am in the process of collecting more data to do a more complete analysis.

Did Gary Johnson ruin the election?
No. In fact, evidence points to Johnson helping Clinton, not hurting her. Looking at the predictions and results in many of the key states (e.g. Pennsylvania, Michigan, Florida, New Hampshire; Wisconsin was a notable exception) Clinton underperformed slightly compared to the expectation, but the far greater effect was that Trump overperformed and Johnson underperformed compared to expectation. This is a pretty good indicator that those who said they'd vote for Johnson ultimately ended up voting for Trump. There seems to be some notion that people were embarrassed to admit they'd vote for Trump in polls. This might be true (but also see this), but the fact that third-party candidates underperform relative to polling is a known effect. However, the magnitude was certainly hard to predict because a third party candidate has not polled so well in recent elections. It doesn't really make sense to assume Johnson voters would vote for Clinton either. When Johnson ran for governor of New Mexico as a Republican, he was the Libertarian outsider, much like Trump was an outsider getting the Republican nomination for president. Certainly, Johnson's views are closer to the conservative agenda than the liberal one.

Turnout affected by election predictions
There has been reporting on how Clinton has gotten the third highest vote total ever of any presidential candidate (after Obama 2008 and Obama 2012). This is a weird metric to judge her on considering turnout decreased compared to 2012 and Clinton got a much smaller percentage of the vote than Obama did in 2012. Ultimately, the statement is just saying that the voting pool has increased, not any deep statement about how successful Clinton is. In particular, let's focus on the 48% of the vote that Clinton got. I have to imagine that if there is a candidate with a low approval and there are claims she has a 98% chance of winning the election, that a lot of people just aren't going to be excited to go vote for her. I could see this manifesting as low turnout and increased third-party support. Stein did do about three times better than she did in 2012 (as did Johnson). In addition, people who really dislike the candidate (and there are a lot of them since the candidate has a low approval rating) are encouraged to show up for the election. I don't see obvious evidence of this, but I have to imagine there was an incentive to go vote against Clinton. This could explain the slight underperformance relative to polls in the aforementioned states as well as the large Clinton underperformance in Wisconsin. There's been talk of fake news affecting the election results but I think the real news predicting the near-certain election of Clinton had just as much to do with it.

Would Clinton have won if the election were decided by popular vote?
This is a very difficult question to answer. The presidential candidates campaign assuming the electoral college system so clearly the election would be different if it were decided by popular vote. Certainly, this seems quite efficient for Democrats. Democratic candidates can campaign in large cities and encourage turnout there, whereas Republican candidates would have to spread themselves thinner to reach their voter base. One thing I haven't seen discussed very much is that this would probably decrease the number of third-party voters. In a winner-takes-all electoral college system, any vote that gives the leading candidate a larger lead is wasted. So, in states like California, where Clinton was projected to have a 23 point advantage over Trump, a rational voter should feel free to vote for a third party since this has no effect on the outcome. In a national popular vote, there are no wasted votes and a rational voter should vote for the candidate that they would actually like to see be president (of course people don't always act rationally). As argued before, Johnson's voters seem to generally prefer Trump over Clinton, so the number of these people that would change their vote under a popular vote election is definitely a relevant factor in deciding whether a national popular vote election would actually have preferred Clinton. Stein's voters would generally prefer Clinton over Trump, but there were fewer of these voters to affect the results.

Electoral college reform needs to happen
Yes, but if it didn't happen after the 2000 election, I think it's unlikely to happen now. The most likely proposal that I have been able to come up with (with the disclaimer that I have very little political know-how and am strictly thinking of this as a mathematical problem) is to increase the number of house members. This is only a change to federal law, and thus would not be as hard to change as the whole electoral college system, which would take a constitutional amendment. If states had a proportional appointment of electors, then as the number of house members increases, the electoral college system approaches a national popular vote election. This is complicated by the winner-takes-all elector system most states have. For example, the total population of the states (and district) Clinton won seems to be 43.7% of the total U.S. population, so even though she won the popular vote, with winner-takes-all systems in place, it is difficult to imagine a simple change to the electoral college system that is closer to a popular vote.