Making Sense of Polls - Part 1

I'm a huge fan of fivethirtyeight (really I read almost every article written), and I think they do a great job of interpreting poll results. However, I do notice that the process in which they compile the results is tedious. Some of the tediousness, like collecting poll results, is inevitable. However, other work including figuring out which polls are reliable and how to interpret polls across different states seems like it might be automatable. I took a stab at this automation using data that fivethirtyeight compiled, which has gathered polling data for various state and national elections three weeks before the election date. This was about 6200 polls, which is a relatively small sample to train a machine learning model on.

I've identified some key points that need to be figured out for any election forecast:
1) How to combine poll results (assuming all polls are good). This will be some form of a moving average, but what kind of moving average taken is up for debate.
2) How to decide which polls are good. Dishonest polls are a problem. If a pollster is partisan, there needs to be a way to take into account.
3) Estimating the uncertainty in the combined polls. The sampling error is relatively small for most polls, but if pollsters choose not to publish a poll if it disagrees with the conventional wisdom, this can introduce bias. There is also uncertainty about how undecided voters and third-party voters will swing.
4) How to determine correlations in the polls. That is, if a candidate performs worse than the polling average would suggest in Pennsylvania, there is likely to be a similar pattern in Wisconsin.

The last issue was tricky, and will not be covered here, but the first three issues are discussed in this post.

I tackle the problem as a time series prediction problem. That is, given a time series (when the polls happen and their results) I want to predict the outcome of the election. This time series can be interpreted as a sequence of events, which means recurrent neural networks (RNNs) are well-suited to solve the problem. RNNs even handle different-length sequences quite naturally, which is a plus as this is awkward to encode into useful input for other types of models like a tree-based model.

I use the data before 2014 as a training set and 2014 as a validation set. Once I've tuned all the parameters on 2014, I retrain the model with both the training and validation set and predict for 2016 presidential races.

In this post, I tackle the problem of predicting individual races, instead of whole elections (or the entire presidential race, which is equally complex). For each poll, I compile a set of information about the polls:
• How each candidate in the race polled, in order of democrat, republican, and highest polling third party candidate, if available. The ordering is relevant as I account for the political preferences of polls.
• The number of days before the election that the poll was conducted¹.
• The sample size of the poll².
• The partisan leaning of the polling agency, if known
• Whether a live caller was used to conduct the poll (default to false if unknown).
• Whether the poll is an Internet poll (default to false if unknown).
• Whether the pollster is a member of the NCPP, AAPOR, or Roper organizations.
• Of polls conducted before the poll, the percentile of the number of polls the pollster has conducted relative to all other pollsters. The intuition here is that agencies that do one poll are not very reliable, whereas agencies that have done many polls probably are.

All of this information is collected into a single vector, which I will call the poll information. I then take all polls of the same race previous to that poll and make an ordered list of the poll informations, which is the sequence that is the input to the neural network model.

With this as input, I have the neural network predict the ultimate margin of the election. I do not include any sense of "year" as input to the neural network as I wish the model to extrapolate on this variable and hence I do not want the model to overfit to any trends there may be in this variable. I use three LSTM layers with 256 units followed by two fully connected layers with 512 neurons. The output layer is one neuron that predicts the final margin of the election. Mean squared error is used as the loss function. I initialize all weights randomly (using the keras defaults), but there might be a benefit to initialize by transfer learning from an exponentially weighted moving average.

I use dropout at time of prediction as a way to get an estimate of the error in the output of the model. The range where 90% of predictions lie using different RNG seeds for the dropout gives a confidence interval³. To calibrate the amount of dropout to apply after each layer, I trained the model on a training set (polls for elections before 2014) and tested different levels of dropout on the validation set (the 2014 election). I find the percentile of the ground truth election result within the Monte Carlo model predictions. Thus, a perfectly calibrated model would have a uniform distribution of the percentile of ground truth election results within the Monte Carlo model predictions. Of course, I do not expect the model to ever be perfectly calibrated, so I chose the dropout rate that minimized the KS-test statistic with the uniform distribution. This turned out to be 40%, which was comforting as this is a typical choice for dropout at training.


A comparison of the calibration (blue) and ideal (green) CDFs for predictions on the test set. For the calibration curve, the optimal dropout of 40% is used.

I then retrain the model using all data before 2016 (thus using both the training and validation set). I take the calibrated dropout rate and again get Monte Carlo samples for polls, using the newly trained model, on the test set. I then count the number of positive-margin Monte Carlo outputs to obtain a probability that the election swings in favor of the first party.

Applying this to a Senate race, I find that the uncertainty in election margin is sizable, around 15%. This is comparable to uncertainties obtained through other aggregation methods, but this shows that it is tough to accurately call close races just based on polls.


Margin predicted (90% CI) by the model for the 2016 presidential election in Pennsylvania. The red line shows the actual margin.

Though this model hasn't learned the relationships between states, I tried applying it to the 2016 presidential election. To get the probability of a candidate winning based on the polls available that day, for each state I run 1000 predictions with different RNG seeds. For each of these 1000 predictions, I add up the electoral votes the candidate would win if they had the predicted margins. The probability of the candidate winning is then the percentage of these outcomes that is below 270.

Histograms of possible presidential election outcomes predicted by the model each day before the election. The outcomes to the left of the red line are cases that result in a Republic victory (the ultimate outcome).

Ultimately, the model showed there was a 93% chance of Clinton winning the election on election day. This is already a more conservative estimate than what some news sources predicted.


The probability of Clinton winning the 2016 election predicted by the model as a function of days before the election.

Unless the 2016 election was a rare event, this shows that clearly, the model is incomplete. Relationships between how states vote compared to polling are crucial to capture. It would also be useful to include more polls in the training set to learn how to aggregate polls more effectively, and in particular, better discern which pollsters are reliable. More features, such as whether the incumbent president is running or if it is an off-year election may also add more information in the predictions. I'll explore some of these ideas in a future blog post.

Code for this blog post is available here.


1. Divide by 365 days to normalize.
2. This is measured in thousands. I normalized using a tanh to get a number between 0 and 1. When not available, this defaults to 0.
3. Though I used Monte Carlo, which seems like a frequentist way to arrive at the confidence interval, it is actually a Bayesian method (see also this paper).
4. There is worry that the number of simulated samples is not sufficient to estimate the percentile when the actual margin is less than the smallest prediction value or more than the largest prediction value. This happened less than 0.1% of the time for most of the choices of dropout rate and is not the primary contributor to the KS test statistic, so it is ignored here.


Predicting Elections from Pictures

This work was done by the fantastic team I mentored during the CDIPS data science workshop.

I got the idea for this project after reading Subliminal by Leonard Mlodinow. That book cited research suggesting that when people are asked to rate pictures of people based on competency, the average competency score of a candidate is predictive of whether the candidate will win or not. The predictions are correct about 70% of the time for senators and 60% for house members, so while not a reliable indicator, there seems to be some correlation between appearance and winning Senate races. So, we decided to use machine learning to build a model to assess senator faces in the same manner.

This is a challenge as there are only around 30 Senate races every two years, so there's not much data to learn from. We also didn't want to use data that was too old since as trends in hair and fashion change, these probably affect how people perceive competence of people. We ended up using elections from 2007-2014 as training data to predict on the 2016 election. We got the senator images from Wikipedia and Google image search. We used images for the top two finishers, which is usually a democrat and republican. We didn't include other elections since the images were less readily available and we aren't sure if appearing senatorial is the same as appearing presidential (more on that later).

Interpreting Faces
We use a neural network to learn the relationships between pictures and likelihood of winning elections. As input, we provide a senator image with the label of whether the candidate won their race or not. Note that this means that our model predicts the likelihood of winning an election given that the candidate is one of the top two candidates in the election (which is usually apparent beforehand). The model outputs a winning probability for each candidate. To assess the winner of a particular election, we compare the probability of the two candidates and assume the candidate with the higher probability will win.

To cut down on training time, we used relatively shallow neural networks consisting of a few sets of convolutional layers followed by max-pooling layers. After the convolutional layers, we used a fully connected layer before outputting the election win probabilities. Even with these simple networks, there are millions of parameters that must be constrained, which will result in overfitting with the relatively limited number of training images. We apply transformations including rotations, translations, blur, and noise to the images to increase the number of training images to make the training more robust. We also explored transfer learning, where we train the model using a similar problem with more data, and use that as a base network to train the senator model on.

We use keras with the tensorflow backend for training. We performed most of our training on floydhub, which offers reasonably priced resources for deep learning training (though it can be a little bit of a headache to set up).

Model Results
Ultimately, we took three approaches to the problem that proved fruitful:
(I) Direct training on senator images (with the image modifications).
(II) Transfer on senator images from male/female classifier trained on faces in the wild.
(III) Transfer on face images from vgg face (this is a much deeper network than the first two).
We compare and contrast each of these approaches to the problem.

The accuracy in predicting the winners in each state in 2016 for each model were respectively (I) 82%, (II) 74%, and (III) 85%. Interestingly, Florida, Georgia, and Hawaii were races that all the models had difficulty predicting, even though these were all races where the incumbent won. These results make model (III) appear the best, but the number of Senate races in 2016 is small, so these results come with a lot of uncertainty. Further, the training and validation sets are not independent. Many senators running in 2016 were incumbents who had run before, and incumbents usually win reelection, so if the model remembers these senators, it can do relatively well.

Screen Shot 2017-08-19 at 4.42.13 PM

The candidates from Hawaii, Georgia, and Florida that all models struggled with. The models predicted the left candidate to beat the right candidate in each case.

Validating Results
We explored other ways to measure the robustness of our model. First, we had each model score different pictures of the same candidate.

Screen Shot 2017-08-19 at 1.22.32 PMScreen Shot 2017-08-19 at 4.53.34 PM

Scores predicted by each of the models for different pictures of the same candidate. Each row is the prediction of each of the three models.

All of our models have some variability in predictions on pictures of the same candidate so our model may benefit from learning on more varied pictures of candidates. We have to be careful, though, as lesser-known candidates will have fewer pictures and this may bias the training. We also see that for model (III), the Wikipedia pictures actually have the highest score among all of the candidate images. Serious candidates, and in particular incumbents, are more likely to have professional photos and the model may be catching this trait.

We also looked at what features the model learned. First, we looked at how the scores changed when parts of the image were covered up. Intuitively, facial features should contribute most to the trustworthiness of a candidate.

Screen Shot 2017-08-19 at 1.24.10 PM

Scores predicted by each of the models for pictures where part of the image is masked. Each row is the prediction of each of the three models.

We find masking the images wildly changes the prediction for models (I) and (II) but not for model (III). It seems that for this model apparel is more important than facial features, as covering up the tie in the image changed the score more than covering up the face.

We also compared what the first convolutional layer in each of the models learned.

Screen Shot 2017-08-19 at 9.22.41 PM

Samples of the first convolutional layer activations after passing in the image on the far left as visualized by the quiver package. Each row shows the output for each of the three models. We see that each model picks up on similar features in the image.

This confirms that apparel is quite significant, with the candidate's tie and suit being picked up by each of the models. The models do also pick up on some of the edges in facial features as well. A close inspection of the layer output shows that the result of model (III) is cleaner than the other two models in picking up these features.

Given all of these findings, we determine the most robust model is model (III), which was the model that predicted the most 2016 elections correctly as well.

Earlier, we mentioned we trained on senator data because we were not sure whether other elections had similar relationships between face and winning. We tested this hypothesis on the last three presidential elections. This is a limited data set, but we find the model predicts only one of the elections correctly. Since presidential elections are so rare, training a model to predict on presidents is a challenge.

Screen Shot 2017-08-19 at 10.41.20 PM

Model (III) predictions on presidential candidates.

Our models were trained to give a general probability of winning an election. We ignore the fact that senator elections, for the most part, are head to head. There may be benefits from training models to consider the two candidates running for the election and having the model choose the winner. Ultimately, we would want to combine the feature created here with other election metrics including polls. This would be another significant undertaking to figure out how to reliably aggregate results, but this may offer orthogonal insights to methods that are currently used to predict election results.

Check out the code for the project here.

2017 Reading List

I've been reading a lot during my commutes this year, and I thought I'd summarize some of my thoughts about the books and blogs I've read and how much I enjoyed them.

Subliminal: How Your Unconscious Mind Rules Your Behavior by Leonard Mlodinow - 4/5
This was a fun read. Lots of examples of how unconscious decisions are actually more prevalent than most people realize. This was particularly interesting as I've been thinking more about unconscious bias lately. This book got me thinking about data-driven approaches to quantify bias (both conscious and unconscious), but it is tricky to define the correct loss function to train this model.

But What If We're Wrong? by Chuck Klosterman - 2/5
Not sure if I got the point of this book (to be fair, the book did warn me that this might happen at the beginning). I thought the book was about politics (the back of the book mentions the president), but what I remember of it was mostly about pop culture and philosophy. Still, it's good to be reminded periodically to try to think how others feel in a two-sided political situation.

Dune by Frank Herbert - 1/5
Couldn't really get into this book. I thought the world was not particularly exciting and apart from a few events early on, I found the story a bit slow.

Data for the People by Andreas Weigend - 5/5
Full disclosure, Andreas is a friend of mine, so it's hard for me to be objective about this book. I think Andreas discusses provocative ideas on the trade that users have with a company, that is giving up some of their privacy in exchange for better services from the company. I worry though that the ideas are hard to implement without an external body to enforce it. There are many fun stories about data to tie all of these ideas together.

Hillbilly Elegy by J.D. Vance - 3/5
As the author of this book is a successful lawyer who grew up relatively poor in the rust belt, the author is the ideal person to translate the ideals of the working class to city dwellers. This book is mostly about the life of the author but touches on religion, family values, and the hillbilly lifestyle that contributed to his worldview. I felt like this definitely helps contextualize some of the voting patterns that I see in the U.S.

Weapons of Math Destruction by Cathy O'Neil - 4/5
Definitely some useful lessons on how defining metrics is vital and how, without retraining, models can get gamed and not serve their intended purpose. I feel like there were a lot of ideas and guidelines for preventing dangerous models, but without an enforcement mechanism, I'm unsure if any of the ideas can come to fruition. I'm always advocating for data-driven solutions to problems, but this book has made me consider that models have to be cautious, especially when biases can be involved.

Wait but Why Blog by Tim Urban - 4/5
This blog offers in-depth analyses of tech ideas like AI and Elon Musk's companies. The author goes deeply into the details about each topic to provide a comprehensive view of the topic. The only downside is that sometimes I feel like all the criticisms of the topic are not adequately addressed.

Election Thoughts

This election was anomalous in many ways. The approval ratings of both candidates were historically low. Perhaps related, third-party candidates were garnering much more support than usual. The nationwide polling of Gary Johnson was close to 5% and Evan McMullin was polling close to 30% in Utah. There's never really been a candidate without a political history who has gotten the presidential nomination of a major party and there's never been a female candidate who has gotten the presidential nomination of a major party.

These anomalies certainly make statistical predictions more difficult. We'd expect that a candidate might perform similarly to past candidates with similar approval, similar ideologies, or similar polling trends, but there were no similar candidates. We have to assume that the trends that carried over in past, very different elections apply to this one, and presumably, this is why so many of the election prediction models were misguided.

I have a few thoughts I wanted to write out. I am in the process of collecting more data to do a more complete analysis.

Did Gary Johnson ruin the election?
No. In fact, evidence points to Johnson helping Clinton, not hurting her. Looking at the predictions and results in many of the key states (e.g. Pennsylvania, Michigan, Florida, New Hampshire; Wisconsin was a notable exception) Clinton underperformed slightly compared to the expectation, but the far greater effect was that Trump overperformed and Johnson underperformed compared to expectation. This is a pretty good indicator that those who said they'd vote for Johnson ultimately ended up voting for Trump. There seems to be some notion that people were embarrassed to admit they'd vote for Trump in polls. This might be true (but also see this), but the fact that third-party candidates underperform relative to polling is a known effect. However, the magnitude was certainly hard to predict because a third party candidate has not polled so well in recent elections. It doesn't really make sense to assume Johnson voters would vote for Clinton either. When Johnson ran for governor of New Mexico as a Republican, he was the Libertarian outsider, much like Trump was an outsider getting the Republican nomination for president. Certainly, Johnson's views are closer to the conservative agenda than the liberal one.

Turnout affected by election predictions
There has been reporting on how Clinton has gotten the third highest vote total ever of any presidential candidate (after Obama 2008 and Obama 2012). This is a weird metric to judge her on considering turnout decreased compared to 2012 and Clinton got a much smaller percentage of the vote than Obama did in 2012. Ultimately, the statement is just saying that the voting pool has increased, not any deep statement about how successful Clinton is. In particular, let's focus on the 48% of the vote that Clinton got. I have to imagine that if there is a candidate with a low approval and there are claims she has a 98% chance of winning the election, that a lot of people just aren't going to be excited to go vote for her. I could see this manifesting as low turnout and increased third-party support. Stein did do about three times better than she did in 2012 (as did Johnson). In addition, people who really dislike the candidate (and there are a lot of them since the candidate has a low approval rating) are encouraged to show up for the election. I don't see obvious evidence of this, but I have to imagine there was an incentive to go vote against Clinton. This could explain the slight underperformance relative to polls in the aforementioned states as well as the large Clinton underperformance in Wisconsin. There's been talk of fake news affecting the election results but I think the real news predicting the near-certain election of Clinton had just as much to do with it.

Would Clinton have won if the election were decided by popular vote?
This is a very difficult question to answer. The presidential candidates campaign assuming the electoral college system so clearly the election would be different if it were decided by popular vote. Certainly, this seems quite efficient for Democrats. Democratic candidates can campaign in large cities and encourage turnout there, whereas Republican candidates would have to spread themselves thinner to reach their voter base. One thing I haven't seen discussed very much is that this would probably decrease the number of third-party voters. In a winner-takes-all electoral college system, any vote that gives the leading candidate a larger lead is wasted. So, in states like California, where Clinton was projected to have a 23 point advantage over Trump, a rational voter should feel free to vote for a third party since this has no effect on the outcome. In a national popular vote, there are no wasted votes and a rational voter should vote for the candidate that they would actually like to see be president (of course people don't always act rationally). As argued before, Johnson's voters seem to generally prefer Trump over Clinton, so the number of these people that would change their vote under a popular vote election is definitely a relevant factor in deciding whether a national popular vote election would actually have preferred Clinton. Stein's voters would generally prefer Clinton over Trump, but there were fewer of these voters to affect the results.

Electoral college reform needs to happen
Yes, but if it didn't happen after the 2000 election, I think it's unlikely to happen now. The most likely proposal that I have been able to come up with (with the disclaimer that I have very little political know-how and am strictly thinking of this as a mathematical problem) is to increase the number of house members. This is only a change to federal law, and thus would not be as hard to change as the whole electoral college system, which would take a constitutional amendment. If states had a proportional appointment of electors, then as the number of house members increases, the electoral college system approaches a national popular vote election. This is complicated by the winner-takes-all elector system most states have. For example, the total population of the states (and district) Clinton won seems to be 43.7% of the total U.S. population, so even though she won the popular vote, with winner-takes-all systems in place, it is difficult to imagine a simple change to the electoral college system that is closer to a popular vote.