This work was done by the fantastic team I mentored during the CDIPS data science workshop.
I got the idea for this project after reading Subliminal by Leonard Mlodinow. That book cited research suggesting that when people are asked to rate pictures of people based on competency, the average competency score of a candidate is predictive of whether the candidate will win or not. The predictions are correct about 70% of the time for senators and 60% for house members, so while not a reliable indicator, there seems to be some correlation between appearance and winning Senate races. So, we decided to use machine learning to build a model to assess senator faces in the same manner.
This is a challenge as there are only around 30 Senate races every two years, so there's not much data to learn from. We also didn't want to use data that was too old since as trends in hair and fashion change, these probably affect how people perceive competence of people. We ended up using elections from 2007-2014 as training data to predict on the 2016 election. We got the senator images from Wikipedia and Google image search. We used images for the top two finishers, which is usually a democrat and republican. We didn't include other elections since the images were less readily available and we aren't sure if appearing senatorial is the same as appearing presidential (more on that later).
We use a neural network to learn the relationships between pictures and likelihood of winning elections. As input, we provide a senator image with the label of whether the candidate won their race or not. Note that this means that our model predicts the likelihood of winning an election given that the candidate is one of the top two candidates in the election (which is usually apparent beforehand). The model outputs a winning probability for each candidate. To assess the winner of a particular election, we compare the probability of the two candidates and assume the candidate with the higher probability will win.
To cut down on training time, we used relatively shallow neural networks consisting of a few sets of convolutional layers followed by max-pooling layers. After the convolutional layers, we used a fully connected layer before outputting the election win probabilities. Even with these simple networks, there are millions of parameters that must be constrained, which will result in overfitting with the relatively limited number of training images. We apply transformations including rotations, translations, blur, and noise to the images to increase the number of training images to make the training more robust. We also explored transfer learning, where we train the model using a similar problem with more data, and use that as a base network to train the senator model on.
We use keras with the tensorflow backend for training. We performed most of our training on floydhub, which offers reasonably priced resources for deep learning training (though it can be a little bit of a headache to set up).
Ultimately, we took three approaches to the problem that proved fruitful:
(I) Direct training on senator images (with the image modifications).
(II) Transfer on senator images from male/female classifier trained on faces in the wild.
(III) Transfer on face images from vgg face (this is a much deeper network than the first two).
We compare and contrast each of these approaches to the problem.
The accuracy in predicting the winners in each state in 2016 for each model were respectively (I) 82%, (II) 74%, and (III) 85%. Interestingly, Florida, Georgia, and Hawaii were races that all the models had difficulty predicting, even though these were all races where the incumbent won. These results make model (III) appear the best, but the number of Senate races in 2016 is small, so these results come with a lot of uncertainty. Further, the training and validation sets are not independent. Many senators running in 2016 were incumbents who had run before, and incumbents usually win reelection, so if the model remembers these senators, it can do relatively well.
The candidates from Hawaii, Georgia, and Florida that all models struggled with. The models predicted the left candidate to beat the right candidate in each case.
We explored other ways to measure the robustness of our model. First, we had each model score different pictures of the same candidate.
Scores predicted by each of the models for different pictures of the same candidate. Each row is the prediction of each of the three models.
All of our models have some variability in predictions on pictures of the same candidate so our model may benefit from learning on more varied pictures of candidates. We have to be careful, though, as lesser-known candidates will have fewer pictures and this may bias the training. We also see that for model (III), the Wikipedia pictures actually have the highest score among all of the candidate images. Serious candidates, and in particular incumbents, are more likely to have professional photos and the model may be catching this trait.
We also looked at what features the model learned. First, we looked at how the scores changed when parts of the image were covered up. Intuitively, facial features should contribute most to the trustworthiness of a candidate.
Scores predicted by each of the models for pictures where part of the image is masked. Each row is the prediction of each of the three models.
We find masking the images wildly changes the prediction for models (I) and (II) but not for model (III). It seems that for this model apparel is more important than facial features, as covering up the tie in the image changed the score more than covering up the face.
We also compared what the first convolutional layer in each of the models learned.
Samples of the first convolutional layer activations after passing in the image on the far left as visualized by the quiver package. Each row shows the output for each of the three models. We see that each model picks up on similar features in the image.
This confirms that apparel is quite significant, with the candidate's tie and suit being picked up by each of the models. The models do also pick up on some of the edges in facial features as well. A close inspection of the layer output shows that the result of model (III) is cleaner than the other two models in picking up these features.
Given all of these findings, we determine the most robust model is model (III), which was the model that predicted the most 2016 elections correctly as well.
Earlier, we mentioned we trained on senator data because we were not sure whether other elections had similar relationships between face and winning. We tested this hypothesis on the last three presidential elections. This is a limited data set, but we find the model predicts only one of the elections correctly. Since presidential elections are so rare, training a model to predict on presidents is a challenge.
Model (III) predictions on presidential candidates.
Our models were trained to give a general probability of winning an election. We ignore the fact that senator elections, for the most part, are head to head. There may be benefits from training models to consider the two candidates running for the election and having the model choose the winner. Ultimately, we would want to combine the feature created here with other election metrics including polls. This would be another significant undertaking to figure out how to reliably aggregate results, but this may offer orthogonal insights to methods that are currently used to predict election results.
Check out the code for the project here.