deep learning

Making Sense of Polls - Part 1

I'm a huge fan of fivethirtyeight (really I read almost every article written), and I think they do a great job of interpreting poll results. However, I do notice that the process in which they compile the results is tedious. Some of the tediousness, like collecting poll results, is inevitable. However, other work including figuring out which polls are reliable and how to interpret polls across different states seems like it might be automatable. I took a stab at this automation using data that fivethirtyeight compiled, which has compiled polling data for various state and national elections three weeks before the election date. This data was about 6200 polls, which is a relatively small sample to train a machine learning model on.

I've identified some key points that need to be figured out for any election forecast:
1) How to combine poll results (assuming all polls are good). This will be some form of a moving average, but exactly what kind of moving average taken is up for debate.
2) How to decide which polls are good. Bad polls are a problem. If a pollster is partisan, there needs to be a way to take into account.
3) Estimating the uncertainty in the combined polls. The sampling error is relatively small for most polls, but if pollsters choose not to publish a poll if it disagrees with the common wisdom, this can introduce bias. There is also uncertainty about how uncertain voters and third-party voters will swing.
4) How to determine correlations in the polls. That is, if a candidate performs worse than the polling average would suggest in Pennsylvania, there is likely to be a similar pattern in Wisconsin.

The last issue was tricky, and will not be covered here, but the first three issues are discussed in this post.

I tackle the problem as a time series prediction problem. That is, given a time series (when the polls happen and their results) I want to predict the outcome of the election. This time series can be interpreted as a sequence of events, which means recurrent neural networks (RNNs) are well-suited to solve the problem. RNNs even handle different-length sequences quite naturally, which is a plus as this is awkward to encode into useful input for other types of models like a tree-based model.

I use the data prior to 2014 as a training set and 2014 as a validation set. Once I've tuned all the parameters on 2014, I retrain the model with both the training and validation set and predict for 2016 presidential races.

In this post, I tackle the problem of predicting individual races, instead of whole elections (or the whole presidential race, which is equally complex). For each poll, I compile a set of information about the polls:
• How each candidate in the race polled, in order of democrat, republican, and highest polling third party candidate, if available. The order is relevant as I account for the political preferences of polls.
• The number of days before the election that the poll was conducted¹.
• The sample size of the poll².
• The partisan leaning of the polling agency, if known
• Whether a live caller was used to conduct the poll (default to false if unknown).
• Whether the poll is an Internet poll (default to false if unknown).
• Whether the pollster is a member of the NCPP, AAPOR, or Roper organizations.
• Of polls conducted before the poll, the percentile of the number of polls the pollster has conducted relative to all other pollsters. The intuition here is that agencies that do one poll are not very reliable, whereas agencies that have done many polls probably are.

All of this information is collected into a single vector, which I will call the poll information. I then take all polls of the same race previous to that poll and make an ordered list of the poll informations, which is the sequence that is the input to the neural network model.

With this as input, I have the neural network predict the ultimate margin of the election. I do not include any sense of "year" as input to the neural network as I wish the model to extrapolate on this variable and hence I do not want the model to overfit to any trends there may be in this variable. I use three LSTM layers with 256 units followed by two fully connected layers with 512 neurons. The output layer is one neuron that predicts the final margin of the election. Mean squared error is used as the loss function. I initialize all weights randomly (using the keras defaults), but there might be a benefit to initialize by transfer learning from an exponentially weighted moving average.

I use dropout at time of prediction as a way to get an estimate of the error in the output of the model. The range where 90% of predictions lie using different RNG seeds for the dropout gives a confidence interval³. To calibrate the amount of dropout to apply after each layer, I trained the model on a training set (polls for elections before 2014) and tested different levels of dropout on the validation set (the 2014 election). I find the percentile of the true election result within the Monte Carlo model predictions. Thus, a perfectly calibrated model would have a uniform distribution of the percentile of true election results within the Monte Carlo model predictions. Of course, I do not expect the model to ever be perfectly calibrated, so I chose the dropout rate that minimized the KS-test statistic with the uniform distribution. This turned out to be 40%, which was comforting as this is a typical choice for dropout at training.

calibration
A comparison of the calibration (blue) and ideal (green) CDFs for predictions on the test set. For the calibration curve, the optimal dropout of 40% is used.

I then retrain the model using all data prior to 2016 (thus using both the training and validation set). I take the calibrated dropout rate and again get Monte Carlo samples for polls, using the newly trained model, on the test set. I then simply count the number of positive margin Monte Carlo outputs to get a probability that the election swings in favor of the first party.

Applying this to a race, I find that the uncertainty in ultimate election margin pretty large, around 15%. This isn't any larger than predictions with other aggregation methods, but this shows that it really is tough to accurately call close races just based on polls.

out
Margin predicted (90% CI) by the model for the 2016 presidential election in Pennsylvania. The red line shows the actual margin.

Though this model hasn't learned the relationships between states, I tried applying it to the 2016 presidential election. To get the probability of a candidate winning based on the polls available that day, for each state I run 1000 predictions with different RNG seeds. For each of these 1000 predictions, I add up the electoral votes the candidate would win if they had the predicted margins. The probability of the candidate winning is then the percentage of these outcomes that is below 270.

movie_1 Histograms of possible presidential election outcomes predicted by the model each day before the election. The outcomes to the left of the red line are cases that result in a Republic victory (the ultimate outcome).

Ultimately, the model showed there was a 93% chance of Clinton winning the election on election day. This is already a more conservative estimate than what some news sources predicted.

win_prob_0
The probability of Clinton winning the 2016 election predicted by the model as a function of days before the election.

Unless the 2016 election truly was a rare event, this shows that clearly, the model is incomplete. Relationships between how states vote compared to polling are crucial to capture. It would also be useful to include more polls in the training set to learn how to aggregate polls more effectively, and in particular, better discern which pollsters are reliable. More features, such as whether the incumbent president is running or if it is an off-year election may also add more information in the predictions. I'll explore some of these ideas in a future blog post.

Code for this blog post is available here.

 

1. Divide by 365 days to normalize.
2. This is measured in thousands. I normalized using a tanh to get a number between 0 and 1. When not available, this defaults to 0.
3. Though I used Monte Carlo, which seems like a frequentist way to arrive at the confidence interval, it is actually a Bayesian method (see also this paper).
4. There is worry that the number of simulated samples is not sufficient to estimate the percentile when the actual margin is less than the smallest prediction value or larger than the largest prediction value. This happened less than 0.1% of the time for most of the choices of dropout rate and is not the main contributor to the KS test statistic, so it is ignored here.

 

Predicting Elections from Pictures

This work was done by the amazing team I mentored during the CDIPS data science workshop.

I got the idea for this project after reading Subliminal by Leonard Mlodinow. That book cited research suggesting that when people are asked to rate pictures of people based on competency, the average competency score of a candidate is predictive of whether the candidate will win or not. The predictions are correct about 70% of the time for senators and 60% for house members, so while not a strong indicator, there seems to be some correlation between appearance and winning Senate races. So, we decided to use machine learning to build a model to assess senator faces in the same manner.

This is a challenge as there are only around 30 Senate races every two years, so there's not much data to learn from. We also didn't want to use data that was too old since as trends in hair and fashion change, these probably affect how people perceive competence of people. We ended up using elections from 2007-2014 as training data to predict on the 2016 election. We got the senator images from Wikipedia and Google image search. We got images for the top two finishers, which is usually a democrat and republican. We didn't include other elections since the images were less readily available and we aren't sure if looking senatorial is the same as looking presidential (more on that later).

Interpreting Faces
We use a neural network to learn the relationships between pictures and likelihood of winning elections. As input, we provide a senator image with the label of whether the candidate won their race or not. Note that this means that our model predicts the likelihood of winning an election given that the candidate is one of the top two candidates in the election (which is usually apparent beforehand). The model outputs a winning probability for each candidate. To assess the winner of a particular election, we compare the probability of the two candidates and assume the candidate with the higher probability will win.

In order to cut down on training time, we used relatively shallow neural networks consisting of a few sets of convolutional layers followed by max-pooling layers. After the convolutional layers, we used a fully connected layer before outputting the election win probabilities. Even with these simple networks, there are millions of parameters that must be constrained, which will result in overfitting with the relatively limited number of training images. We apply transformations including rotations, translations, blur, and noise to the images in order to increase the number of training images to make the training more robust. We also explored transfer learning, where we train the model using a similar problem with more data, and use that as a base network to train the senator model on.

We use keras with the tensorflow backend for training. We performed most of our training on floydhub, which offers reasonably priced resources for deep learning training (though it can be a little bit of a headache to set up).

Model Results
Ultimately, we took three approaches to the problem that proved fruitful:
(I) Direct training on senator images (with the image modifications).
(II) Transfer on senator images from male/female classifier trained on faces in the wild.
(III) Transfer on face images from vgg face (this is a much deeper network than the first two).
We compare and contrast each of these approaches to the problem.

The accuracy in predicting the winners in each state in 2016 for each model were respectively (I) 82%, (II) 74%, and (III) 85%. Interestingly, Florida, Georgia, and Hawaii were races that all the models had difficulty predicting, even though these were all races where the incumbent won. These results make model (III) appear the best, but the number of Senate races in 2016 is small, so these results come with a lot of uncertainty. In addition, the training and validation sets are not independent. Many senators running in 2016 were incumbents who had run before, and incumbents usually win reelection, so if the model remembers these senators, it can do relatively well.

Screen Shot 2017-08-19 at 4.42.13 PM

The candidates from Hawaii, Georgia, and Florida that all models struggled with. The models predicted the left candidate to beat the right candidate in each case.

Validating Results
We explored other ways to measure the robustness of our model. First, we had each model score different pictures of the same candidate.

Screen Shot 2017-08-19 at 1.22.32 PMScreen Shot 2017-08-19 at 4.53.34 PM

Scores predicted by each of the models for different pictures of the same candidate. Each row is the prediction of each of the three models.

All of our models have some variability in predictions on pictures of the same candidate so our model may benefit from learning on more varied pictures of candidates. We have to be careful, though, as lesser-known candidates will have fewer pictures and this may bias the training. We also see that for model (III), the Wikipedia pictures actually have the highest score among all of the candidate images. Serious candidates, and in particular incumbents, are more likely to have professional photos and the model may be catching this trait.

We also looked at what features the model learned. First, we looked at how the scores changed when parts of the image were covered up. Intuitively, facial features should contribute most to the trustworthiness of a candidate.

Screen Shot 2017-08-19 at 1.24.10 PM

Scores predicted by each of the models for pictures where part of the image is masked. Each row is the prediction of each of the three models.

We find masking the images wildly changes the prediction for models (I) and (II) but not for model (III). It seems that for this model apparel is more important than facial features, as covering up the tie in the image changed the score more than covering up the face.

We also compared what the first convolutional layer in each of the models learned.

Screen Shot 2017-08-19 at 9.22.41 PM

Samples of the first convolutional layer activations after passing in the image on the far left as visualized by the quiver package. Each row shows the output for each of the three models. We see that each model picks up on similar features in the image.

This confirms that apparel is quite important, with the candidate's tie and suit being picked up by each of the models. The models do also pick up on some of the edges in facial features as well. A close inspection of the layer output shows that the output of model (III) is cleaner than the other two models in picking up these features.

Given all of these findings, we determine the most robust model is model (III), which was the model that predicted the most 2016 elections correctly as well.

Conclusion
Earlier, we mentioned we trained on senator data because we were not sure whether other elections had similar relationships between face and winning. We tested this hypothesis on the last three presidential elections. This is clearly an extremely limited data set, but we find the model predicts only one of the elections correctly. Since presidential elections are so rare, training a model to predict on presidents is a challenge.

Screen Shot 2017-08-19 at 10.41.20 PM

Model (III) predictions on presidential candidates.

Our models were trained to give a general probability of winning an election. We ignore the fact that senator elections, for the most part, are head to head. There may be benefits from training models to consider the two candidates running for the election and having the model choose the winner. Ultimately, we would want to combine the feature created here with other election metrics including polls. This would be another large undertaking to figure out how to reliably combine results, but this may offer orthogonal insights to methods that are currently used to predict election results.

Check out the code for the project here.