About Lenny Evans

Posts by Lenny Evans:

Making Sense of Polls - Part 1

I'm a huge fan of fivethirtyeight (really I read almost every article written), and I think they do a great job of interpreting poll results. However, I do notice that the process in which they compile the results is tedious. Some of the tediousness, like collecting poll results, is inevitable. However, other work including figuring out which polls are reliable and how to interpret polls across different states seems like it might be automatable. I took a stab at this automation using data that fivethirtyeight compiled, which has compiled polling data for various state and national elections three weeks before the election date. This data was about 6200 polls, which is a relatively small sample to train a machine learning model on.

I've identified some key points that need to be figured out for any election forecast:
1) How to combine poll results (assuming all polls are good). This will be some form of a moving average, but exactly what kind of moving average taken is up for debate.
2) How to decide which polls are good. Bad polls are a problem. If a pollster is partisan, there needs to be a way to take into account.
3) Estimating the uncertainty in the combined polls. The sampling error is relatively small for most polls, but if pollsters choose not to publish a poll if it disagrees with the common wisdom, this can introduce bias. There is also uncertainty about how uncertain voters and third-party voters will swing.
4) How to determine correlations in the polls. That is, if a candidate performs worse than the polling average would suggest in Pennsylvania, there is likely to be a similar pattern in Wisconsin.

The last issue was tricky, and will not be covered here, but the first three issues are discussed in this post.

I tackle the problem as a time series prediction problem. That is, given a time series (when the polls happen and their results) I want to predict the outcome of the election. This time series can be interpreted as a sequence of events, which means recurrent neural networks (RNNs) are well-suited to solve the problem. RNNs even handle different-length sequences quite naturally, which is a plus as this is awkward to encode into useful input for other types of models like a tree-based model.

I use the data prior to 2014 as a training set and 2014 as a validation set. Once I've tuned all the parameters on 2014, I retrain the model with both the training and validation set and predict for 2016 presidential races.

In this post, I tackle the problem of predicting individual races, instead of whole elections (or the whole presidential race, which is equally complex). For each poll, I compile a set of information about the polls:
• How each candidate in the race polled, in order of democrat, republican, and highest polling third party candidate, if available. The order is relevant as I account for the political preferences of polls.
• The number of days before the election that the poll was conducted¹.
• The sample size of the poll².
• The partisan leaning of the polling agency, if known
• Whether a live caller was used to conduct the poll (default to false if unknown).
• Whether the poll is an Internet poll (default to false if unknown).
• Whether the pollster is a member of the NCPP, AAPOR, or Roper organizations.
• Of polls conducted before the poll, the percentile of the number of polls the pollster has conducted relative to all other pollsters. The intuition here is that agencies that do one poll are not very reliable, whereas agencies that have done many polls probably are.

All of this information is collected into a single vector, which I will call the poll information. I then take all polls of the same race previous to that poll and make an ordered list of the poll informations, which is the sequence that is the input to the neural network model.

With this as input, I have the neural network predict the ultimate margin of the election. I do not include any sense of "year" as input to the neural network as I wish the model to extrapolate on this variable and hence I do not want the model to overfit to any trends there may be in this variable. I use three LSTM layers with 256 units followed by two fully connected layers with 512 neurons. The output layer is one neuron that predicts the final margin of the election. Mean squared error is used as the loss function. I initialize all weights randomly (using the keras defaults), but there might be a benefit to initialize by transfer learning from an exponentially weighted moving average.

I use dropout at time of prediction as a way to get an estimate of the error in the output of the model. The range where 90% of predictions lie using different RNG seeds for the dropout gives a confidence interval³. To calibrate the amount of dropout to apply after each layer, I trained the model on a training set (polls for elections before 2014) and tested different levels of dropout on the validation set (the 2014 election). I find the percentile of the true election result within the Monte Carlo model predictions. Thus, a perfectly calibrated model would have a uniform distribution of the percentile of true election results within the Monte Carlo model predictions. Of course, I do not expect the model to ever be perfectly calibrated, so I chose the dropout rate that minimized the KS-test statistic with the uniform distribution. This turned out to be 40%, which was comforting as this is a typical choice for dropout at training.

calibration
A comparison of the calibration (blue) and ideal (green) CDFs for predictions on the test set. For the calibration curve, the optimal dropout of 40% is used.

I then retrain the model using all data prior to 2016 (thus using both the training and validation set). I take the calibrated dropout rate and again get Monte Carlo samples for polls, using the newly trained model, on the test set. I then simply count the number of positive margin Monte Carlo outputs to get a probability that the election swings in favor of the first party.

Applying this to a race, I find that the uncertainty in ultimate election margin pretty large, around 15%. This isn't any larger than predictions with other aggregation methods, but this shows that it really is tough to accurately call close races just based on polls.

out
Margin predicted (90% CI) by the model for the 2016 presidential election in Pennsylvania. The red line shows the actual margin.

Though this model hasn't learned the relationships between states, I tried applying it to the 2016 presidential election. To get the probability of a candidate winning based on the polls available that day, for each state I run 1000 predictions with different RNG seeds. For each of these 1000 predictions, I add up the electoral votes the candidate would win if they had the predicted margins. The probability of the candidate winning is then the percentage of these outcomes that is below 270.

movie_1 Histograms of possible presidential election outcomes predicted by the model each day before the election. The outcomes to the left of the red line are cases that result in a Republic victory (the ultimate outcome).

Ultimately, the model showed there was a 93% chance of Clinton winning the election on election day. This is already a more conservative estimate than what some news sources predicted.

win_prob_0
The probability of Clinton winning the 2016 election predicted by the model as a function of days before the election.

Unless the 2016 election truly was a rare event, this shows that clearly, the model is incomplete. Relationships between how states vote compared to polling are crucial to capture. It would also be useful to include more polls in the training set to learn how to aggregate polls more effectively, and in particular, better discern which pollsters are reliable. More features, such as whether the incumbent president is running or if it is an off-year election may also add more information in the predictions. I'll explore some of these ideas in a future blog post.

Code for this blog post is available here.

 

1. Divide by 365 days to normalize.
2. This is measured in thousands. I normalized using a tanh to get a number between 0 and 1. When not available, this defaults to 0.
3. Though I used Monte Carlo, which seems like a frequentist way to arrive at the confidence interval, it is actually a Bayesian method (see also this paper).
4. There is worry that the number of simulated samples is not sufficient to estimate the percentile when the actual margin is less than the smallest prediction value or larger than the largest prediction value. This happened less than 0.1% of the time for most of the choices of dropout rate and is not the main contributor to the KS test statistic, so it is ignored here.

 

Thoughts on Teaching

While I was getting my Ph.D., I taught almost every semester. I was a head TA for all but one semester which meant that I managed a team of TAs and worked closely with the course instructors, but didn't do any classroom teaching. I taught introductory electricity and magnetism the entire time and as I saw different professors and different students go through the material, I noticed that our course really wasn't optimized to maximize how well students learn the course material. This was surprising to me as I was at an institution that is consistently rated highly by many university rating agencies.

I'll borrow a concept from Wait But Why of how surprising advancements are. For example, my parents used computers the size of a room that has the computational capacity of a computer that now can be worn around the wrist. However, the way my parents were taught isn't too different from how teaching is done now. For the most part, lectures are still a lecturer standing at the front of the room writing on a blackboard. The lecturer may use a powerpoint presentation, and the students may be distracted from their laptops and phones, but the fundamental way of teaching hasn't really changed.

There are, of course, efforts to modernize learning through websites like Coursera and Khan Academy. I think the real benefit of these is real-time feedback to the instructor (in this case, the algorithms behind the course). In the course I taught, the instructor only got individual feedback from the questions students asked (which not everyone does), as they never saw student's homework or test scores except in aggregate. However, taking courses through these online services is not recognized as sufficient mastery when compared to a university course, and I doubt that this will change anytime soon. So, I thought it would be interesting to consider what keeps universities from making efforts to make their teaching more effective.

Universities are not incentivized to teach well
When I started teaching, I thought that the metric I should optimize for was student's understanding of the course material. The problem is that this is not easy to measure. Most grading at the university level seems to be done on a curve (which as I've written before, I have concerns about) to sidestep this issue. Even grading on a curve fairly is not trivial as writing good exams is an art. I think only one instructor I worked with had truly mastered it. Poorly written, or even poorly administered, exams provide a weak signal of student understanding. As instructors write different exams, it's even harder to combine signals from different courses taught by different instructors to get an aggregate view of how students learn.

This is an idea that is talked about in Cathy O'Neil's Weapons of Math Destruction. Because educational quality is something that is ill-defined, rating agencies, like US News, use proxies that seem to correlate with educational quality to rank universities. This sets up a perverse incentive structure, as the university is offering the service of teaching (though some may argue the service they offer is really the degree) but they optimize around the metrics the rating agency chooses instead of actually making the learning experience great. Companies choose to hire from the perceived best schools, which means that if a high school student wants to set up for a great career, it's in their best interest to apply for these top universities even if it means they will not get the best education.

Looking at the US News' metric, it seems that 20-30% of the metric that goes into ranking colleges actually directly comes from how teaching is conducted, in terms of class size and funds allocated to teaching students. An almost equal proportion (22.5%) of the score is reputation, which would be hard to assess from an outsider who has not taken courses at the university. Part of this reputation could be the university's willingness to try techniques such as more active learning in classes, but this is handled in an ad-hoc way and I would like to see it be a more direct part of the metric.

Professors are not rewarded for teaching well
I will preface this section by saying that I worked with some amazing instructors while being a head TA. I don't want to belittle the great work that they do. The fact is, though, that there are instructors who are not so good who are still teaching students and that is the topic of the rest of this post.

I found that as I worked to optimize my students' understanding of course material, that not all of the professors shared the same goal. I found that some professors (but definitely not all) were optimizing for student happiness, under the constraints of having to rank students in terms of performance. This makes sense from the instructor's point of view as the metrics the department sees of the instructor is the grade distribution they assign and the feedback they receive from students (and maybe a peer review by other faculty members). I've always found it shocking that the department did not care what the course TAs thought of the professors, though admittedly this could be a noisy signal as some courses only have one TA.

I felt that some of the instructors who I felt were ineffective at getting their students to understand the material got the highest ratings by students. In fact, it's been shown that student surveys are another metric that is a poor proxy for teaching effectiveness.

In fact, tenure makes it hard to get rid of a professor, and I found that often the tenured professors were some of the worst instructors in terms of student understanding. While being a Nobel prize eligible physicist is a great qualification for getting tenure, this really isn't a necessary or even sufficient condition to be a great lecturer¹. On the flip side, some lecturers were some of the best instructors I worked with but they are hired on limited term contracts and are paid little compared to their tenured colleagues.

Professors are averse to change
As a head TA, I worked with an instructor who was technologically illiterate and could not do much more on his computer than respond to email. There was another professor whose methods I had pedagogical concerns about. When I raised these to him, he responded with something along the lines of "no, we're not changing it because this is the way I've done it for 20 years and the students like it this way." I worked with others who are oversubscribed with research and family obligations that they can't be bothered to learn a new tool for the course they are teaching. This certainly isn't all the instructors I worked with, but the instructors described here are still teaching and will continue to teach unless broad changes are made.

Where I do see some willingness to adopt new tools is when work is being automated on the professor's side. In our course, we used Mastering Physics to automate the grading of homework assignments and reduce the grading burden on TAs. One concern I have with Mastering Physics is that it locks universities into multi-year contracts, creating friction to move to a new system. Combined with the instructor's desires not to learn a new system, this causes Mastering Physics to stick around, and there is not much pressure for Mastering Physics to improve their services. I've found that this has led to the course being stuck with a bad system without much hope for improvement.

I've noticed some quick wins for Mastering Physics that were not implemented in the time I used the service. For example, I noticed that 75% of the wrong answers my students submitted had incorrect units. It would be relatively easy for Mastering Physics to implement a check for this and give the students more useful feedback so that they learn their mistakes more easily.

Apparently, when Mastering Physics was first introduced into introductory physics courses, the professors found, using some standardized physics understanding tests, that understanding of physics actually increased after using Mastering Physics. The professors attributed this to the fact that the students got instant feedback on whether their answers were correct. This sounds great for Mastering Physics, but the problem is that the solution to the problems are now easily accessible and so it is easy to get credit for the homework problems without actually solving them.

I have no clue how large of a fraction of the students actually do the Mastering Physics problems. Sometimes we put problems very similar (or almost identical) to homework problems on the exams and I'm quite surprised that many students will struggle to do the problems on the exam. Even creating new problems is not a solution, as there are services like Chegg that have others work out problems for the student. In theory, this was possible pre-internet, but the scale of the internet makes it that much easier for students to exploit.

Even beyond that, Mastering Physics and other products like Webassign are limited in scope. They're rarely used outside of introductory classes because they can really only check the final answer. I do think there is more that can be automated in grading assignments. A lot of hard science and math questions boil down to choosing the correct starting point and showing a logical progression of steps to the final result. This is entirely within the realm of tasks a computer can do. Even for other disciplines where answers may be short answers, advancements in NLP are probably at the point where an answer could be compared with sample response(s) with quite a high accuracy. We would still need a human to check to make sure the model wasn't performing crazily, but this would greatly speed up the grading process.

The Future
Getting rating agencies, universities, and professors to care about teaching effectiveness is not an easy task. To me, the fundamental issue is that teaching effectiveness cannot be measured easily and poor proxies are being used to measure it. If rating agencies could rate universities based on teaching effectiveness, this would put pressure on universities to care more, which would then put pressure on instructors to care more.

I've mentioned before about how analyzing Gradescope data could be useful in helping instructors get a better idea of what each individual student has actually learned. As mentioned earlier, this is quite valuable as the lecturers for the course I taught never graded a homework assignment or a test. I could foresee in the future instructors being given a dashboard where they can look at what individual students understand while also being able to segment by factors such as attendance rate to get near real-time feedback about how they are performing.

But I also think Gradescope may be able to tackle the teaching effectiveness problem. By offering a service that allows for faster, less-biased grading, instructors are likely to adopt the service. I've seen that instructors respond well when tedious parts of their work are automated. As adoption increases, Gradescope gets a unique set of data, with different courses across many different disciplines across many different universities. Making sense of all the data is certainly a challenge, but if I had to guess, I would say that this data has insights about teaching effectiveness that can hopefully drive some of the necessary changes to education.

 

1. This may not be true in every discipline. I have taken introductory anthropology classes in which the instructor's research was a relevant topic of discussion.

Predicting Elections from Pictures

This work was done by the amazing team I mentored during the CDIPS data science workshop.

I got the idea for this project after reading Subliminal by Leonard Mlodinow. That book cited research suggesting that when people are asked to rate pictures of people based on competency, the average competency score of a candidate is predictive of whether the candidate will win or not. The predictions are correct about 70% of the time for senators and 60% for house members, so while not a strong indicator, there seems to be some correlation between appearance and winning Senate races. So, we decided to use machine learning to build a model to assess senator faces in the same manner.

This is a challenge as there are only around 30 Senate races every two years, so there's not much data to learn from. We also didn't want to use data that was too old since as trends in hair and fashion change, these probably affect how people perceive competence of people. We ended up using elections from 2007-2014 as training data to predict on the 2016 election. We got the senator images from Wikipedia and Google image search. We got images for the top two finishers, which is usually a democrat and republican. We didn't include other elections since the images were less readily available and we aren't sure if looking senatorial is the same as looking presidential (more on that later).

Interpreting Faces
We use a neural network to learn the relationships between pictures and likelihood of winning elections. As input, we provide a senator image with the label of whether the candidate won their race or not. Note that this means that our model predicts the likelihood of winning an election given that the candidate is one of the top two candidates in the election (which is usually apparent beforehand). The model outputs a winning probability for each candidate. To assess the winner of a particular election, we compare the probability of the two candidates and assume the candidate with the higher probability will win.

In order to cut down on training time, we used relatively shallow neural networks consisting of a few sets of convolutional layers followed by max-pooling layers. After the convolutional layers, we used a fully connected layer before outputting the election win probabilities. Even with these simple networks, there are millions of parameters that must be constrained, which will result in overfitting with the relatively limited number of training images. We apply transformations including rotations, translations, blur, and noise to the images in order to increase the number of training images to make the training more robust. We also explored transfer learning, where we train the model using a similar problem with more data, and use that as a base network to train the senator model on.

We use keras with the tensorflow backend for training. We performed most of our training on floydhub, which offers reasonably priced resources for deep learning training (though it can be a little bit of a headache to set up).

Model Results
Ultimately, we took three approaches to the problem that proved fruitful:
(I) Direct training on senator images (with the image modifications).
(II) Transfer on senator images from male/female classifier trained on faces in the wild.
(III) Transfer on face images from vgg face (this is a much deeper network than the first two).
We compare and contrast each of these approaches to the problem.

The accuracy in predicting the winners in each state in 2016 for each model were respectively (I) 82%, (II) 74%, and (III) 85%. Interestingly, Florida, Georgia, and Hawaii were races that all the models had difficulty predicting, even though these were all races where the incumbent won. These results make model (III) appear the best, but the number of Senate races in 2016 is small, so these results come with a lot of uncertainty. In addition, the training and validation sets are not independent. Many senators running in 2016 were incumbents who had run before, and incumbents usually win reelection, so if the model remembers these senators, it can do relatively well.

Screen Shot 2017-08-19 at 4.42.13 PM

The candidates from Hawaii, Georgia, and Florida that all models struggled with. The models predicted the left candidate to beat the right candidate in each case.

Validating Results
We explored other ways to measure the robustness of our model. First, we had each model score different pictures of the same candidate.

Screen Shot 2017-08-19 at 1.22.32 PMScreen Shot 2017-08-19 at 4.53.34 PM

Scores predicted by each of the models for different pictures of the same candidate. Each row is the prediction of each of the three models.

All of our models have some variability in predictions on pictures of the same candidate so our model may benefit from learning on more varied pictures of candidates. We have to be careful, though, as lesser-known candidates will have fewer pictures and this may bias the training. We also see that for model (III), the Wikipedia pictures actually have the highest score among all of the candidate images. Serious candidates, and in particular incumbents, are more likely to have professional photos and the model may be catching this trait.

We also looked at what features the model learned. First, we looked at how the scores changed when parts of the image were covered up. Intuitively, facial features should contribute most to the trustworthiness of a candidate.

Screen Shot 2017-08-19 at 1.24.10 PM

Scores predicted by each of the models for pictures where part of the image is masked. Each row is the prediction of each of the three models.

We find masking the images wildly changes the prediction for models (I) and (II) but not for model (III). It seems that for this model apparel is more important than facial features, as covering up the tie in the image changed the score more than covering up the face.

We also compared what the first convolutional layer in each of the models learned.

Screen Shot 2017-08-19 at 9.22.41 PM

Samples of the first convolutional layer activations after passing in the image on the far left as visualized by the quiver package. Each row shows the output for each of the three models. We see that each model picks up on similar features in the image.

This confirms that apparel is quite important, with the candidate's tie and suit being picked up by each of the models. The models do also pick up on some of the edges in facial features as well. A close inspection of the layer output shows that the output of model (III) is cleaner than the other two models in picking up these features.

Given all of these findings, we determine the most robust model is model (III), which was the model that predicted the most 2016 elections correctly as well.

Conclusion
Earlier, we mentioned we trained on senator data because we were not sure whether other elections had similar relationships between face and winning. We tested this hypothesis on the last three presidential elections. This is clearly an extremely limited data set, but we find the model predicts only one of the elections correctly. Since presidential elections are so rare, training a model to predict on presidents is a challenge.

Screen Shot 2017-08-19 at 10.41.20 PM

Model (III) predictions on presidential candidates.

Our models were trained to give a general probability of winning an election. We ignore the fact that senator elections, for the most part, are head to head. There may be benefits from training models to consider the two candidates running for the election and having the model choose the winner. Ultimately, we would want to combine the feature created here with other election metrics including polls. This would be another large undertaking to figure out how to reliably combine results, but this may offer orthogonal insights to methods that are currently used to predict election results.

Check out the code for the project here.

Thoughts on Grad School

For some background, there is no way I would have known in college that I would not have wanted to apply for postdocs after my Ph.D. My career plan then was to go through grad school, be successful, and someday end up a professor. During my Masters' degree, I realized that this may not be the life I wanted. My advisers at the time (who were married to each other) would routinely be at the institute until 8 or 9 in the evening and would come in early in the morning. I often wondered if they discussed much else than physics at home. During my Ph.D., I also observed my advisers working long hours and working on weekends and holidays. Once my advisor even told me that anyone I (a physicist) dated should expect for me to be unavailable on weekends and holidays when there was important physics to do. Still, I was in too deep at this point so I determined the minimal amount of work I needed to get a Ph.D. and completed that. Now that I have a job as a data scientist, the benefits of my Ph.D. seem to be the people I met and the connections I made (which did help me get the job), but the actual knowledge I gained during the degree has been mostly useless. Looking back, I'm reminded of some of the major issues with UC Berkeley and academia and have outlined them here.

Academia takes advantage of people
For some reason, during grad school, you are expected to volunteer your time, with no pay or credit. This is especially apparent during the summer when my contract said I was supposed to work 19 hours/week, but my adviser expected me to come in 40+ hours/week. What also shocked me is when I told fellow grad students about this issue, they were not even aware they were only supposed to be working halftime. Further, if an adviser does not have funding and the student teaches during summer to cover costs, the student has no obligation to do research during that summer, but this isn't communicated to the student. These facts are rarely spelled out. The sad thing is, this doesn't end with grad school. I know post-docs (at my institution and others) who have told me that their contracts say they should work around 40 hours/week, but are routinely actually expected to work 50-60 hours/week.

My last semester in grad school I was not enrolled in classes, not getting paid for research and was just working on my thesis. Hence, there was no obligation for me to do anything that was not for my benefit. Still, my advisers tried to guilt me into coming into the office often (with threats of not signing my thesis) and to continue doing work. It still bothers me I paid the university (for full disclosure, I made plenty of money during a summer internship to cover these costs) for the "opportunity" to work for the research group for that semester.

When research is done for course credit, the lines are blurred a bit. The time spent on the "course" is not necessarily fixed. I would think that if it is a "course," then research obligations related to the course should start when the semester starts and end when the semester ends. Certainly, the "course" should not require a student to attend a meeting on a university holiday or weekend (which happened to me during grad school). Most universities have standards and expect professors teaching courses to be available for their students. Some graduate students I know talk to their advisors a couple of times a semester which I would think hardly respects these standards. This has also led me to question what can and cannot be asked of a student in a course. For example, if there were a "course" in T-shirt making that made students work in sweat-shop like conditions to get a grade in a class, would this be legal? While not a tangible product, research for credit is a somewhat similar scenario where the students are producing papers that will ultimately benefit the professor's fame (and a small chance of benefiting the student).

Another issue is that, as budgets get tighter, expenses get passed off to students. During my time at Berkeley, keeping pens and paper in a storeroom was deemed too expensive. The department suggested each research group make their own purchases of these items through the purchasing website. Not only is this a waste of graduate student's time, the website was so terrible that often it was easier just to buy items and not get reimbursed for them. Most research is done on student's personal laptops, and while a necessity to continue work, rarely is their support from the university to make this purchase or pay for maintenance when it is used for research work. There is no IT staff, so again it falls on students and post-docs to waste their time dealing with network issues and computer outages rather than focusing on the work that is actually interesting.

Academia tries to ignore that most of its graduate students will not go into academia
I apparently have a roughly 50% chance of "making it" as a professor, which is mostly because I attended a good institution and published in a journal with a high impact factor. Ph.D. exit surveys have found similar rates for the fraction of students that stay in academia. Yet, the general expectation in academia is that all of the students will go on to do a research-focused career (I know some professors who look down upon a teaching-focused career as well even though this is still technically academia).

Even after making it extremely clear to my adviser that I had no intentions of pursuing a postdoc, he told me that I should think about applying. He went so far as to say data science (my chosen profession) was a fad and would probably die out in a few years. Once during his class, he seemed quite proud of the fact that between industry and academic jobs, most of his students had wound up staying in physics. While my adviser wasn't too unhappy about me taking courses unrelated to research, many advisers will strongly encourage their students to focus on research and not take classes. This is terrible advice, considering useful skills in computer science and math are often crucial to getting jobs outside of academia.

The physics curriculum, in general, is flawed. At no point in a typical undergraduate/graduate curriculum are there courses on asymptotic analysis, algorithms, numerical methods, or rigorous statistics, which are all useful both inside and outside of physics. Often these are assumed known or trivial, yet this gives physicists a poor foundation and can lead to problems when working on relevant problems. I took classes on all of these topics, though they were optional, and they are proving to be more useful to me than most of the physics courses I took or even research I conducted as a graduate student.

This is a problem at the institutional level as well. For physics graduate students at UC Berkeley, the qualifying exam is set up to test students on topics of the student's choosing. I chose numerical methods and statistics as my main topic, but my committee asked me no questions on numerical methods and statistics (to be fair, one member tried but he didn't know what to ask, but then again, he could have prepared something to ask since the topics are announced months before the exam). Instead, my committee asked me general plasma physics and quantum mechanics questions which were not topics I had chosen. I failed to answer those questions (and have even less ability to answer it now), but somehow still passed the exam. This convinced me that the exam was nothing more than a formality, yet no one was willing to make changes to make the exam more useful. An easy change would be to frame the oral exam as interview practice, as most graduate students have no experience with interviews.

There are many student-led efforts to try to make transitioning into a non-academic job easier at UC Berkeley. But the issue is, for the most part, they are student-led and have little faculty support. I was very involved in these groups and it was quite clear that my adviser did not want me to be involved. Considering that involvement is why I have a job right now, I would say I made the right choice.

Your adviser has a lot of power over you
As I alluded to before, even though I was receiving no money or course credit from my research group in my last semester, my advisers threatened to not sign my thesis unless I completed various research tasks (some unrelated to the actual thesis). I've talked to others that have had similar experiences, and I hate to say I'm confident that this extends beyond research tasks in some research groups. This could be solved by making the thesis review process anonymous and have the advisers take no part in it, but there seem to be no efforts to make this happen.

Another fault is that advisers can get rid of students on a whim. I've known people who have sunk three years into a research group only to be told that they cannot continue. Then, the student has to make a decision to start from zero and spend a ridiculous amount of their life in grad school or leave without a degree making the three years in the research group irrelevant. Because of this power, professors can make their students work long hours and come in on weekends and holidays. If the student does not oblige, the time the student has already put into the research group is just wasted time.

It doesn't help that research advisers are usually also respected members of their research area. That is, if a student decides to stay in academia (particularly in the same research area as grad school), the word of their adviser could make or break their career. This again gives an opportunity for advisers to request favors from students.

Further, the university has every motive to protect professors, especially those with tenure, but not graduate students. This became embarrassingly clear, for example, in how the university handles Geoff Marcy before Buzzfeed got a hold of the news. If professors can ignore rules set by the University (and sometimes laws) with little repercussion, there is little faith graduate students can have that new rules the university creates to any of the problems mentioned here will be followed. I don't want to belittle the (mostly student-led) efforts to make sexual harassment less of a problem at UC Berkeley, but ultimately the only change I can pinpoint is that there is now more sexual harassment training, which has been shown by research at UC Berkeley to lead to more sexual harassment incidents.

Ultimately, graduate school and a career in academia certainly work for some people. Some people are passionate about science and love working on their problems, even if that means making a few sacrifices. Progression of science is a noble, necessary goal, and I am glad there are people out there to make it happen. My hope is that many of the problems mentioned here can be rectified so that the experience for those people and also the people who realize that academia isn't for them can succeed on a different path.

2017 Reading List

I've been reading a lot during my commutes this year, and I thought I'd summarize some of my thoughts about the books and blogs I've read and how much I enjoyed them.

Subliminal: How Your Unconscious Mind Rules Your Behavior by Leonard Mlodinow - 4/5
This was a fun read. Lots of examples of how unconscious decisions are actually more prevalent than most people realize. This was particularly interesting as I've been thinking more about unconscious bias lately. This book got me thinking about data-driven approaches to quantify bias (both conscious and unconscious), but it is obviously tricky to define the correct loss function to train this model.

But What If We're Wrong? by Chuck Klosterman - 2/5
Not sure if I got the point of this book (to be fair, the book did warn me that this might happen at the beginning). I thought the book was about politics (the back of the book mentions the president), but what I remember of it was mostly about pop culture and philosophy. Still, it's good to be reminded periodically to try to think how others feel in a two-sided political situation.

Dune by Frank Herbert - 1/5
Couldn't really get into this book. I thought the world was not particularly interesting and apart from a few events early on, I found the story a bit slow.

Data for the People by Andreas Weigend - 5/5
Full disclosure, Andreas is a friend of mine so it's hard for me to be objective about this book. I think Andreas discusses interesting ideas on the trade that users have with a company, that is giving up some of their privacy in exchange for better services from the company. I worry though, that the ideas are hard to implement without an external body to enforce it. There are many fun stories about data to tie all of these ideas together.

Hillbilly Elegy by J.D. Vance - 3/5
As the author of this book is a successful lawyer who grew up relatively poor in the rust belt, the author is the ideal person to translate the ideals of the working class to city dwellers. This book is mostly about the life of the author but touches on religion, family values, and the hillbilly lifestyle that contributed to his worldview. I felt like this definitely helps contextualize some of the voting patterns that I see in the U.S.

Weapons of Math Destruction by Cathy O'Neil - 4/5
Definitely some good lessons on how defining metrics is key and how, without retraining, models can get gamed and not serve their intended purpose. I feel like there were a lot of ideas and guidelines for preventing dangerous models, but without an enforcement mechanism, I'm unsure if any of the ideas can come to fruition. I'm always advocating for data-driven solutions to problems, but this book has made me consider that models have to be cautious, especially when biases can be involved.

Wait but Why Blog by Tim Urban - 4/5
This blog offers in-depth analyses of tech ideas like AI and Elon Musk's companies. The author goes heavily into the details about each topic to provide an in-depth view of the topic. The only downside is that sometimes I feel like all the criticisms of the topic are not adequately addressed.

Who will win Top Chef Season 14?

Warning: Spoilers ahead if you have not seen the first two episodes of the new season

In the first episode of the season Brooke, after winning the quickfire, claimed she was in a good position because the winner of the first challenge often goes on to win the whole thing. Actually, only one contestant has won the first quickfire and gone on to win the whole thing (Richard in season 8), and that was a team win. The winner of the first elimination challenge has won the competition 5 of 12 times (not counting season 9 when a whole team won the elimination challenge). This got me wondering if there were other predictors as to who would win Top Chef.

There's not too much data after the first elimination challenge, but I tried building a predictive model using the chef's gender, age, quickfire and elimination performance, and current residence (though I ultimately selected the most predictive features from the list). I used this data as features with a target variable of elimination number to build a gradient-boosted decision tree model to predict when the chefs this season would be eliminated. I validated the model with seasons 12 and 13 and then applied the model to season 14. I looked at the total distance between the predicted and actual placings of the contestants as the metric to optimize during validation. The model predicted both of these seasons correctly, but seasons 12 and 13 were two seasons where the winner of the first elimination challenge became top chef.

The most important features in predicting the winner were: elimination challenge 1 performance, season (catching general trends across seasons), gender, home state advantage, being from Boston, being from California, and being from Chicago. Male chefs do happen to do better as do chefs from the state where Top Chef is being filmed. Being from Chicago is a little better than being from California, which is better than being from Chicago. To try to visualize this better, I used these important features and performed a PCA to plot the data in two dimensions. This shows how data clusters, without any knowledge of the ultimate placement of the contestants.

topchefcontestants

A plot of the PCA components using the key identified features. The colors represent the ultimate position of the contestants. Blue represents more successful contestants where red represents less successful contestants. The x direction corresponds mostly to first elimination success (with more successful contestants on the right) and the y direction corresponds mostly to gender (with male on top). The smaller spreads correspond to the other features, such as the contestant's home city. We see that even toward the left there are dark blue points, meaning that nothing is a certain deal-breaker in terms of winning the competition, but of course, winning the first challenge puts you in a better position.

My prediction model quite predictably puts Casey as the favorite for winning it all, with Katsuji in second place. The odds are a bit stacked against Casey though. If she were male or from Chicago or if this season's Top Chef were taking place in California, she would have a higher chance of winning. Katsuji's elevated prediction is coming from being on the winning team in the first elimination while being male and from California. He struggled a bit when he was last on the show, though, so I don't know if my personal prediction would put him so high. Brooke, even though she thought she was in a good position this season, is tied for fifth place according to my prediction. My personal prediction would probably put her higher since she did so well in her previous season.

Of course, there's only so much the models can predict. For one thing, there's not enough data to reliably figure out how returning chefs do. This season, it's half new and half old contestants. The model probably learned a bit of this, though, since the experienced chefs won the first elimination challenge, which was included in the model. One thing I thought about including but didn't was what the chefs actually cooked. I thought certain ingredients or cooking techniques might be relevant features for the predictive model. However, this data wasn't easy to find without re-watching all the episodes, and given the constraints of all the challenges, I wasn't sure these features would be all that relevant (e.g. season 11 was probably the only time turtle was cooked in an elimination challenge). Obviously, with more data the model would get better; most winners rack up some wins by the time a few elimination challenges have passed.

The code is available here.

Election Thoughts

This election was anomalous in many ways. The approval ratings of both candidates were historically low. Perhaps related, third-party candidates were garnering much more support than usual. The nationwide polling of Gary Johnson was close to 5% and Evan McMullin was polling close to 30% in Utah. There's never really been a candidate without a political history who has gotten the presidential nomination of a major party and there's never been a female candidate who has gotten the presidential nomination of a major party.

These anomalies certainly make statistical predictions more difficult. We'd expect that a candidate might perform similarly to past candidates with similar approval, similar ideologies, or similar polling trends, but there were no similar candidates. We have to assume that the trends that carried over in past, very different elections apply to this one, and presumably, this is why so many of the election prediction models were misguided.

I have a few thoughts I wanted to write out. I am in the process of collecting more data to do a more complete analysis.

Did Gary Johnson ruin the election?
No. In fact, evidence points to Johnson helping Clinton, not hurting her. Looking at the predictions and results in many of the key states (e.g. Pennsylvania, Michigan, Florida, New Hampshire; Wisconsin was a notable exception) Clinton underperformed slightly compared to the expectation, but the far greater effect was that Trump overperformed and Johnson underperformed compared to expectation. This is a pretty good indicator that those who said they'd vote for Johnson ultimately ended up voting for Trump. There seems to be some notion that people were embarrassed to admit they'd vote for Trump in polls. This might be true (but also see this), but the fact that third-party candidates underperform relative to polling is a known effect. However, the magnitude was certainly hard to predict because a third party candidate has not polled so well in recent elections. It doesn't really make sense to assume Johnson voters would vote for Clinton either. When Johnson ran for governor of New Mexico as a Republican, he was the Libertarian outsider, much like Trump was an outsider getting the Republican nomination for president. Certainly, Johnson's views are closer to the conservative agenda than the liberal one.

Turnout affected by election predictions
There has been reporting on how Clinton has gotten the third highest vote total ever of any presidential candidate (after Obama 2008 and Obama 2012). This is a weird metric to judge her on considering turnout decreased compared to 2012 and Clinton got a much smaller percentage of the vote than Obama did in 2012. Ultimately, the statement is just saying that the voting pool has increased, not any deep statement about how successful Clinton is. In particular, let's focus on the 48% of the vote that Clinton got. I have to imagine that if there is a candidate with a low approval and there are claims she has a 98% chance of winning the election, that a lot of people just aren't going to be excited to go vote for her. I could see this manifesting as low turnout and increased third-party support. Stein did do about three times better than she did in 2012 (as did Johnson). In addition, people who really dislike the candidate (and there are a lot of them since the candidate has a low approval rating) are encouraged to show up for the election. I don't see obvious evidence of this, but I have to imagine there was an incentive to go vote against Clinton. This could explain the slight underperformance relative to polls in the aforementioned states as well as the large Clinton underperformance in Wisconsin. There's been talk of fake news affecting the election results but I think the real news predicting the near-certain election of Clinton had just as much to do with it.

Would Clinton have won if the election were decided by popular vote?
This is a very difficult question to answer. The presidential candidates campaign assuming the electoral college system so clearly the election would be different if it were decided by popular vote. Certainly, this seems quite efficient for Democrats. Democratic candidates can campaign in large cities and encourage turnout there, whereas Republican candidates would have to spread themselves thinner to reach their voter base. One thing I haven't seen discussed very much is that this would probably decrease the number of third-party voters. In a winner-takes-all electoral college system, any vote that gives the leading candidate a larger lead is wasted. So, in states like California, where Clinton was projected to have a 23 point advantage over Trump, a rational voter should feel free to vote for a third party since this has no effect on the outcome. In a national popular vote, there are no wasted votes and a rational voter should vote for the candidate that they would actually like to see be president (of course people don't always act rationally). As argued before, Johnson's voters seem to generally prefer Trump over Clinton, so the number of these people that would change their vote under a popular vote election is definitely a relevant factor in deciding whether a national popular vote election would actually have preferred Clinton. Stein's voters would generally prefer Clinton over Trump, but there were fewer of these voters to affect the results.

Electoral college reform needs to happen
Yes, but if it didn't happen after the 2000 election, I think it's unlikely to happen now. The most likely proposal that I have been able to come up with (with the disclaimer that I have very little political know-how and am strictly thinking of this as a mathematical problem) is to increase the number of house members. This is only a change to federal law, and thus would not be as hard to change as the whole electoral college system, which would take a constitutional amendment. If states had a proportional appointment of electors, then as the number of house members increases, the electoral college system approaches a national popular vote election. This is complicated by the winner-takes-all elector system most states have. For example, the total population of the states (and district) Clinton won seems to be 43.7% of the total U.S. population, so even though she won the popular vote, with winner-takes-all systems in place, it is difficult to imagine a simple change to the electoral college system that is closer to a popular vote.

Grade Inflation

I have been thinking a lot about teaching lately (maybe now that I will not be teaching anymore) and I hope to write a series of a few blog posts about it. My first post here will be on grade inflation, specifically whether curving is an effective way to combat it.

A popular method to combat grade inflation seems to be to impose a set curve for all classes. That is, for example, the top 25% of students get As, the next 35% get Bs and the bottom 40% gets Cs, Ds, and Fs (which is the guideline for my class). While this necessarily avoids the problem of too many people getting As, it can be a bit too rigid, which I will show below.

In the class I teach, there are ~350 students, who are spread among three lectures. I will investigate what effect the splitting of students into the lectures has on their grade. First, I will make an incredibly simple model where I assume there is a "true" ranking of all the students. That is, if all the students were actually in one big class, this would be the ordering of their grades in the course. I will assume that the assessments given to the students in the classes they end up in are completely fair. That is, if their "true" ranking is the highest of anyone in the class, they will get the highest grade in the class and if their "true" ranking is the second highest of anyone in the class they will get the second highest grade and so on. I then assign students randomly to three classes and see how much their percentile in the class fluctuates based on these random choices. This is shown below

fluctuation

The straight black line shows the percentile a student would have gotten had the students been in one large lecture. The black curve above and below it shows the 90% variability in percentile due to random assignment.

We see that even random assignment can cause significant fluctuations, and creates variability particularly for the students in the "middle of the pack." Most students apart from those at the top and bottom could have their letter grade change by a third of a letter grade just due to how the classes were chosen.

Further, this was just assuming the assignment was random. Often, the 8 am lecture has more freshman because they register last and lectures at better times are likely to fill up. There may also be a class that advanced students would sign up for that conflicts with one of the lecture times of my course. This would cause these advanced students to prefer taking the lectures at other times. These effects would only make the story worse from what's shown above.

We have also assumed that each class has the ability to perfectly rank the students from best to lowest. Unfortunately, there is variability in how exam problems are graded and how good questions are at distinguishing students, and so the ranking is not consistent between different lectures. This would tend to randomize positions as well.

Another issue I take with this method of combating grade inflation is that it completely ignores whether students learned anything or not. Since the grading is based on a way to rank students, even if a lecturer is ineffective and thus the students in the course don't learn very much, the student's score will be relatively unchanged. Now, it certainly seems unfair for a student to get a bad grade because their lecturer is poor, but it seems like any top university should not rehire anyone who teaches so poorly that their students learn very little (though I know this is wishful thinking). In particular, an issue here is that how much of the course material students learned is an extremely hard factor to quantify without imposing standards. However, standardized testing leads to ineffective teaching methods (and teaching "to the test") and is clearly not the answer. I'm not aware of a good way to solve this problem, but I think taking data-driven approaches to study this would be extremely useful for education.

In my mind, instead of imposing fixed grade percentages for each of the classes, the grade percentages should be imposed on the course as a whole. That is, in the diagram above, ideally the upper and lower curves would be much closer to the grade in the "true ranking" scenario. Thus, luck or scheduling conflicts have much less of an effect on a students grade. Then the question becomes how to accomplish this. This would mean that sometimes classes would get 40% As and maybe sometimes 15% As, but it would be okay because this is the grade the students should get.

My training in machine learning suggests that bagging would be a great way to reduce the variance. This would mean having three different test problems on each topic and randomly assigning each student one of these three problems. Apart from the logistic nightmare this would bring about, this would really only work when one lecturer is teaching all the classes. For example, if one of the lecturers is much better than another or likes to do problems close to test problems in lecture, then the students will perform better relative to students in other lectures because of their lecturer. To make this work, there needs to be a way to "factor out" the effect of the lecturer.

Another method would be to treat grading more like high school and set rigid grade distributions. The tests would then have to be written in a way such that we'd expect the outcome of the test to follow the guideline grade distributions set by the university, assuming the students in the class follow the general student population. Notably, the test is not written so that the particular course will follow the guideline grade distribution. Of course this is more work than simply writing a test, and certainly, the outcome of a test is hard to estimate. Often I've given tests and been surprised at the outcome, though this is usually due to incomplete information, such as not knowing the instructor did an extremely similar problem as a test problem in class.

One way to implement this would be to look at past tests and look at similar problems, and see how students did on those problems. (Coincidentally, this wasn't possible to do until recently when we started using Gradescope). This gives an idea how we would expect students to perform, and we can use this data to weight the problem appropriately. Of course, we (usually) don't want to give students problems they'll have seen while practicing exams and so it is hard to define how similar a problem is. To do this right requires quite a bit of past data on tests, and as I mentioned earlier this isn't available. Similar problems given by other professors may help, but then we run into the same problem above in that different lecturers will standardize differently from how they decide to teach the course.

Without experimenting with different solutions, it's impossible to figure out what the best solution is, but it seems crazy to accept that curving classes is the best way. Through some work, there could be solutions that encourage student collaboration, reward students for doing their homework (I hope to write more on this in the future) instead of penalizing them for not doing their homework, and take into account how much students are actually learning.

Code for the figure is available here.

Topic Modeling and Gradescope

In this post, I'll be looking at trends in exam responses of physics students. I'll be looking at the Gradescope data from a midterm that my students took in a thermodynamics and electromagnetism course. In particular, I was interested if things students get right correlate with the physics expectation. For example, I might expect students who were able to apply Gauss's law correctly to be able to apply Ampere's law correctly as the two are quite similar.

I'll be using nonnegative matrix factorization (NMF) for the topic modeling. This is a technique that is often applied to topic modeling for bodies of text (like the last blog post). The idea of NMF is to take a matrix with positive entries, A, and find matrices W and H, also with positive entries, such that

 A = WH.

Usually, W and H will be chosen to be low-rank matrices and the equality above will be approximate. Then, a vector in A is now expressed as the positive linear combination of the small number of rows (topics) of W. This is natural for topic modeling as everything is positive, meaning that cancellations between the rows of W cannot occur.

The data

For each student and for each rubric item, Gradescope stores whether the grader selected that item for the student. Each rubric item has points associated with it, so I use this as the weight of the matrix to perform the NMF on. The problem, though, is that some rubric items correspond to points being taken off from the student, which is not a positive quantity. In this case, I took a lack of being penalized to be the negative of the penalty, and those that were penalized had a 0 entry in that position of the matrix.

There was also 0 point rubric items (we use these mostly as comments that apply to many students). I ignore these entries. But finding a way to incorporate this information could also be interesting.

Once the matrix is constructed, I run NMF on it to get the topic matrix W and the composition matrix H. I look at the entries in W with the highest values, and these are the key ideas on the topic.

Results

The choice of the number of topics (the rank of W and H above) was not obvious. Ideally, it would be a small number (like 5) so it would be easy to just read off the main topics. However, this seems to pair together some unrelated ideas by virtue of them being difficult (presumably because the better students did well on these points). Another idea was to look at the error \| A - WH\|_2 and to determine where it flattened out. As apparent below, this analysis suggested that adding more topics after 20 did not help to reduce the error in the factorization.


With 20 topics, it was a pain to look through all of them to determine what each topic represented. Further, some topics were almost identical. One such example was a problem relating to finding the work in an adiabatic process. Using the first law of thermodynamics and recognizing the degrees of freedom were common to two topics. However, one topic had to be able to compute the work correctly, as the other one did not. This is probably an indication that the algebra leading up to finding the work was difficult for some. I tried to optimize these problems and ultimately chose 11 topics, which seems to work reasonably well.

Some "topics" are topics simply by virtue of being worth many points. This would be rubric items with entries such as "completely correct" or "completely incorrect." This tends to hide the finer details that in a problem (e.g. a question testing multiple topics, which is quite common in tests we gave). These topics often had a disproportionate number of points attributed to them. Apart from this, most topics seemed to have roughly the same number of points attributed to them.

Another unexpected feature was that I got a topic that negatively correlated with one's score. This was extremely counter-intuitive as in NMF each topic can only positively contribute to score, so having a significant component in a score necessarily means having a higher score. The reason this component exists is that it captures rubric items that almost everyone gets right. A higher scoring student will get the points in these rubric items from other topics that also contain this rubric item. Most of the other topics had high contributions from rubric items that fewer than 75% of students obtained.

Many topics were contained within a problem, but related concepts across problems did cluster as topics. For example, finding the heat lost in a cyclic process correlated with being able to equate heat in to heat out in another problem. However, it was more common for topics to be entirely contained in a problem.

The exam I analyzed was interesting as we gave the same exam to two groups of students, but had different graders grade the exams (and therefore construct different rubrics). Some of the topics found (like being able to calculate entropy) were almost identical across the two groups, but many topics seemed to cluster rubric items slightly differently. Still, the general topics seemed to be quite consistent between the two exams.


  

The plots show a student's aptitude in a topic as a function of their total exam score for three different topics. Clearly, depending on the topic the behaviors can look quite different.

Looking at topics by the student's overall score has some interesting trends as shown above. As I mentioned before, there are a small number (1 or 2) topics which students with lower scores will "master," but these are just the topics that nearly all of the students get points for. A little over half the topics are ones which students who do well excel at, but where a significant fraction of lower scoring students have trouble with. The example shown above is a topic that involves calculating the entropy change and heat exchange when mixing ice and water. This may be indicative of misconceptions that students have in approaching these problems. My guess here would be that students did not evaluate an integral to determine the entropy change, but tried to determine it in some other way.

The rest of the topics (2-4) were topics where the distribution of points was relatively unrelated to the total score on the exam. In the example shown above, the topic was calculating (and determining the right signs) of work in isothermal processes, which is a somewhat involved topic. This seems to indicate that success in this topic is unrelated to understanding the overall material. It is hard to know exactly, but my guess is that these topics test student's ability to do algebra more than their understanding of the material.

I made an attempt to assign a name to each of the topics that were found by analyzing a midterm (ignoring the topic that negatively correlated with score). The result was the following: heat in cyclic processes, particle kinetics, entropy in a reversible system, adiabatic processes, work in cyclic processes, thermodynamic conservation laws, particle kinetics and equations of state, and entropy in an irreversible system. This aligns pretty well with what I would expect students to have learned by their first midterm in the course. Of course, not every item in each topic fit nicely with these topics. In particular, the rubric items that applied to many students (>90%) would often not follow the general topic.

Ultimately, I was able to group the topics into four major concepts: thermodynamic processes, particle kinetics and equations of state, entropy, and conservation laws. The following spider charts show various student's abilities in each of the topics. I assumed each topic in a concept contributed equally to the concept.


  

Aptitude in the four main concepts for an excellent student (left) an average student (middle) and a below average student (right).

Conclusions

Since the data is structured to be positive and negative (points can be given or taken off), there may be other matrix decompositions that deal with the data better. In principle, this same analysis could be done using not the matrix of points, but the matrix of boolean (1/0) indicators of rubric items. This would also allow us to take into account the zero point rubric items that were ignored in the analysis. I do not know how this would change the observed results.

I had to manually look through the descriptions of rubric items that applied to each topic and determine what the topic being represented was. An exciting (though challenging) prospect would be to be able to automate this process. This is tricky, though, as associations that S and entropy are the same could be tricky. There may also be insights from having "global" topics across different semesters of the same course in Gradescope.

The code I used for this post is available here.

Natural Language Processing and Twitch Chat

This post will be about using natural language processing (NLP) to extract information from Twitch chat. I found the chat log of a popular streamer, and I'll be analyzing one day of chats. I'll keep the streamer anonymous just because I didn't ask his permission to analyze his chat.

On this particular day, the streamer had 88807 messages in his chat with 11312 distinct users chatting. This is an average of about 7.9 chats/user. However, this doesn't mean that most people actually post this much. In fact, 4579 (or about 40%) of users only posted one message. This doesn't take into account the people that never posted, but it shows that it is quite common for users to "lurk," or watch the stream without actively taking part in chat. The distribution of posts per user is shown below:



A histogram of the frequency of number of messages in chat (note the log scale). Almost all of the people in chat post less than 10 posts. The chat bots were not included here, so everyone represented in the plot should be an actual user.

Only 1677 (or about 15%) of users posted 10 or more posts in chat, but they accounted for 65284 messages (about 73.5%). This seems to imply that there may be some form of Pareto principle at work here.

What are people talking about?

I used tf-idf on the chat log to get a sense for common words and phrases. The tf in tf-idf stands for word frequency, and is, for each chat message, how many times a certain word appears in that chat message [1]. idf stands for inverse document frequency, and is, for each term, the negative log of the fraction of all messages that the term appears in. The idf is an indicator for how much information there is in a word. Common words like "the" and "a" don't carry much information. tf-idf multiplies the two into one index for each term in each chat message. The words with the highest tf-idf are then the most used words in chat. The following table shows some of the common words in chat

 

All Chatters           >10 Chat Messages           One Chat Message
lol lol game
kappa kappa lol
kreygasm kreygasm kappa
pogchamp pogchamp kreygasm
game game wtf
myd kkona stream
kkona dansgame followage

The words with highest score under tf-idf for all messages, those who post many messages, and those who only post one message.

Not surprisingly, the people who chat a lot have a similar distribution of words as all the messages (remember, they are about 73.5% of all the messages). Those who only had one message in chat are talking about slightly different things than those who chat a lot. There are a few interesting features, which I will elaborate on below.

myd and followage are bot commands on the streamer's stream. Apparently gimmicks like this are fairly popular, but this means that there are many people chatting without adding content to the stream. It is interesting that those that post more are far less likely to play with these bot commands.

On this day the streamer was playing random games that his subscribers had suggested. This led to weird games and thus many people commented on the game, hence the prevalence of words like "game" and "wtf". People who only post one message seem more likely to complain about the game than those who talk often. For words like this, it could be interesting to see how their prevalence shifts when different games are played.

For those not familiar with Twitch, kappa, kreygasm, pogchamp, kkona, and dansgame are all emotes. Clearly, the use of emotes is quite popular. kkona is an emote on BTTV (a Twitch extension), so it is quite interesting how many people have adopted its use, and this may also indicate why it is more popular with people who post more.

Who do people talk about?

I wanted to see what kind of "conversations" take place in Twitch chat, so I selected for references to other users and then again looked at the most common words under tf-idf. Unfortunately this method will miss many references (e.g. if there were a user who was Nick482392, other people might simply refer to him as Nick) but for an exploratory analysis, it seemed sufficient.

The most referenced person was, predictably, the streamer himself, with 1232 messages mentioning him. The top words for the streamer included "play" with countless suggestions for what other games the streamer should play. During this day, apparently another prominent streamer was talking about the streamer I analyzed, and many people commented on this. There were also many links directed at the streamer. There were no particularly negative words in the most common words directed at the streamer.

I also considered references to other users. There were 4697 of these, though some of these references are simply due to a user having the same name as an emote. Other than the emotes prevalent in general (Kappa, PogChamp), a common word among references was "banned," talking about people who had been banned from talking on the stream by moderators. An interesting thing to look at may also be to look at what kinds of things mods ban for and try to automate some of that process. Another common word was whisper, which was a feature recently added to Twitch. People are at least talking about this feature, which probably means it is getting used as well.

Profanity?

I then looked at all chat messages containing profane words to see if there were trends in how this language was directed. There were 5542 messages that contained profanity, with the most common word being variants of "fuck." The word "game" was often in posts with profanity, which isn't too strange because as mentioned earlier, a lot of people were complaining about the game choice on this day. Other words that were popular in general, such as kappa and kreygasm, were also present in posts with profanity.

The streamer had a visible injury on this day, and there were a few words related to this injury that correlated highly with profanity. These would be messages like "what the hell happened to your arm?" The streamer's name was also quite prevalent in messages that contained profanity.

A little less common than that was a reference to "mods." It seems that people get upset with moderators for banning people and possibly being too harsh. Right below this is "subs," whom there seems to be quite a bit of hostility towards. I'm not sure if this is when subscriber only chat was used, but the use of profanity with "subs" is spread out throughout all of the messages during the day.

There are some profane words that come in bursts (presumably as a reaction to what is happening on the stream). Terms like "sex her" seem to come in bursts, which seems to show some of the more sexist aspects of the Twitch chat ("sex" was a word included as profanity even though it may not qualify as that in all cases).

Conclusions

The ubiquity of emotes on Twitch may be an interesting reason to conduct general NLP research through Twitch chat. Many of these emotes have sentiments or intentions tied to them, and for the most part, people use them for the "right" purpose. For example, Kappa is indicative of sarcasm or someone who is trolling. Sarcasm is notoriously hard for NLP to detect so having a hint like the Kappa emote could reveal general trends in sarcasm [2]. This would be a cool application of machine learning to NLP (maybe a future blog post?).

From a more practical point of view, information like this could be useful to streamers to figure out how they are doing. For example, if a streamer is trying some techniques to get chat more involved, it may be interesting to see if they are successful and they manage to increase the number of chatters with many posts. One thing I didn't consider is how top words change from day-to-day. The game being played and other factors such as recent events may cause these to fluctuate which could be interesting. Of course, more sophisticated analyses can be conducted than looking at top words, for example, looking at the grammar of the messages and seeing what the target of profanity is.

I also just considered one streamer's stream (because I couldn't find many chat logs), and I'm sure it would be interesting to see how other streams differ. The streamer I analyzed is clearly an extremely popular streamer, but it may be interesting to see if the distribution of the engagement level of chatters is different on smaller of streams. It would also be interesting to see if the things said toward female streamers are particularly different than those said to male streamers.

The code I used for this post is available here.

References
1. Yin, D. et al., 2009. Detection of Harassment on Web 2.0. Proceedings of the Content Analysis in the WEB 2.0 (CAW 2.0) Workshop at WWW2009.
2. Gonzalez-Ibanez, R., Muresan, S., and Wacholder, N., 2011. Identifying Sarcasm in Twitter: A Closer Look . Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2, 581-586.