Thoughts on Teaching

While I was getting my Ph.D., I taught almost every semester. I was a head TA for all but one semester which meant that I managed a team of TAs and worked closely with the course instructors, but didn't do any classroom teaching. I taught introductory electricity and magnetism the entire time, and as I saw various professors and various students go through the material, I noticed that our course really wasn't optimized to maximize how well students learn the course material. This was surprising to me as I was at an institution that is consistently rated highly by many university rating agencies.

I'll borrow a concept from Wait But Why of how remarkable advancements are. For example, my parents used computers the size of a room that has the computational capacity of a computer that now can be worn around the wrist. However, the way my parents were taught isn't too different from modern teaching. For the most part, lectures are still a lecturer standing at the front of the room writing on a blackboard. The lecturer may use a powerpoint presentation, and the students may be distracted from their laptops and phones, but the fundamental way of teaching hasn't really changed.

There are, of course, efforts to modernize learning through websites like Coursera and Khan Academy. I think the real benefit of these is real-time feedback to the instructor (in this case, the algorithms behind the course). In the class I taught, the instructor only got individual feedback from the questions students asked (which not everyone does), as they never saw student's homework or test scores except in aggregate. However, taking courses through these online services is not recognized as sufficient mastery when compared to a university course, and I doubt that this will change anytime soon. So, I thought it would be interesting to consider what keeps universities from making efforts to make their teaching more effective.

Universities are not incentivized to teach well
When I started teaching, I thought that the metric I should optimize for was student's understanding of the course material. The problem is that this is not easy to measure. Most grading at the university level seems to be done on a curve (which as I've written before, I have concerns about) to sidestep this issue. Even grading on a curve reasonably is not trivial as writing good exams is an art. I think only one instructor I worked with had truly mastered it. Poorly written, or even poorly administered, exams provide a weak signal of student understanding. As instructors write different exams, it's another challenge to combine signals from separate courses taught by different instructors to get an aggregate view of how students learn.

This is an idea that is talked about in Cathy O'Neil's Weapons of Math Destruction. Because educational quality is something that is ill-defined, rating agencies, like US News, use proxies that seem to correlate with educational quality to rank universities. This sets up a perverse incentive structure, as the university is offering the service of teaching (though some may argue the service they provide is really the degree) but they optimize around the metrics the rating agency chooses instead of actually making the learning experience great. Companies decide to hire from the perceived best schools, which means that if a high school student wants to set up for a great career, it's in their best interest to apply for these top universities even if it means they will not get the best education.

Looking at the US News' metric, it seems that 20-30% of the metric that goes into ranking colleges actually directly comes from how teaching is conducted, considering class size and funds allocated to teaching students. An almost equal proportion (22.5%) of the score is reputation, which would be hard to assess from an outsider who has not taken courses at the university. Part of this reputation could be the university's willingness to try techniques such as more active learning in classes, but this is handled in an ad-hoc way, and I would like to see it be a more direct part of the metric.

Professors are not rewarded for teaching well
I will preface this section by saying that I worked with some amazing instructors while being a head TA. I don't want to belittle the great work that they do. The fact is, though, that there are instructors who are not so good who are still teaching students and that is the topic of the rest of this post.

I found that as I worked to optimize my students' understanding of course material, that not all of the professors shared the same goal. I found that some professors (but definitely not all) were optimizing for student happiness, under the constraints of having to rank students by performance. This makes sense from the instructor's point of view as the metrics the department sees of the instructor is the grade distribution they assign and the feedback they receive from students (and maybe a peer review by other faculty members). I've always found it shocking that the department did not care what the course TAs thought of the professors, though admittedly this could be a noisy signal as some classes only have one TA.

I felt that some of the instructors who I felt were ineffective at getting their students to understand the material got the highest ratings by students. In fact, it's been shown that student surveys are another metric that is a poor proxy for teaching effectiveness.

In fact, tenure makes it hard to get rid of a professor, and I found that often the tenured professors were some of the worst instructors at optimizing student understanding. While being a Nobel prize eligible physicist is an excellent qualification for getting tenure, this really isn't a necessary or even sufficient condition to be a great lecturer¹. On the flip side, some lecturers were some of the best instructors I worked with, but they are hired on limited term contracts and are paid little compared to their tenured colleagues.

Professors are averse to change
As a head TA, I worked with an instructor who was technologically illiterate and could not do much more on his computer than respond to email. There was another professor whose methods I had pedagogical concerns about. When I raised these to him, he retorted with something along the lines of "no, we're not changing it because this is the way I've done it for 20 years and the students like it this way." I worked with others who are oversubscribed with research and family obligations that they can't be bothered to learn a new tool for the course they are teaching. This isn't all the instructors I worked with, but the instructors described here are still teaching and will continue to teach unless broad changes are made.

Where I do see some willingness to adopt new tools is when work is being automated on the professor's side. In our course, we used Mastering Physics to automate the grading of homework assignments and reduce the grading burden on TAs. One concern I have with Mastering Physics is that it locks universities into multi-year contracts, creating friction to move to a new system. Combined with the instructor's desires not to learn a new system, this causes Mastering Physics to stick around, and there is not much pressure for Mastering Physics to improve their services. I've found that this has led to the course continuing to use a suboptimal system without much hope for improvement.

I've noticed some quick wins for Mastering Physics that were not implemented in the time I used the service. For example, I noticed that 75% of the wrong answers my students submitted had incorrect units. It would be relatively easy for Mastering Physics to implement a check for this and give the students more useful feedback so that they learn their mistakes more easily.

Apparently, when Mastering Physics was first introduced into introductory physics courses, the professors found, using some standardized physics understanding tests, that understanding of physics actually increased after using Mastering Physics. The professors attributed this to the fact that the students got instant feedback on whether their answers were correct. This sounds great for Mastering Physics, but the problem is that the solution to the problems are now easily accessible and so it is easy to get credit for the homework problems without actually solving them.

I have no clue what fraction of students actually did the Mastering Physics problems. Sometimes we put problems very similar (or almost identical) to homework problems on the exams, and I'm quite surprised that many students will struggle to do the problems on the exam. Even creating new problems is not a solution, as there are services like Chegg that have others work out problems for the student. In theory, this was possible pre-internet, but the scale of the internet makes it that much easier for students to exploit.

Even beyond that, Mastering Physics and other products like WebAssign are limited in scope. They're rarely used outside of introductory classes because they can really only check the final answer. I do think there is more that can be automated in grading assignments. A lot of hard science and math questions boil down to choosing the correct starting point and showing a logical progression of steps to the final result. This is entirely within the realm of tasks a computer can do. Even for other disciplines where answers may be short answers, advancements in NLP are probably at the point where a solution could be compared with sample response(s) with quite a high accuracy. We would still need a human to check to make sure the model wasn't performing crazily, but this would considerably speed up the grading process.

The Future
Getting rating agencies, universities, and professors to care about teaching effectiveness is not an easy task. To me, the fundamental issue is that teaching effectiveness cannot be measured easily and poor proxies are being used to measure it. If rating agencies could rate universities based on teaching effectiveness, this would put pressure on universities to care more, which would then put pressure on instructors to care more.

I've mentioned before about how analyzing Gradescope data could be useful in helping instructors get a better idea of what each individual student has actually learned. As mentioned earlier, this is quite valuable as the lecturers for the course I taught never graded a homework assignment or a test. I could foresee in the future instructors being given a dashboard where they can look at what individual students understand while also being able to segment by factors such as attendance rate to get near real-time feedback about how they are performing.

But I also think Gradescope may be able to tackle the teaching effectiveness problem. By offering a service that allows for faster, less-biased grading, instructors are likely to adopt the service. I've seen that instructors respond well when tedious parts of their work are automated. As adoption increases, Gradescope gets a unique set of data, with different courses across many different disciplines across many different universities. Making sense of all the data is a challenge, but if I had to guess, I would say that this data has insights about teaching effectiveness that can hopefully drive some of the necessary changes to education.


1. This may not be true in every discipline. I have taken introductory anthropology classes in which the instructor's research was a relevant topic of discussion.

Grade Inflation

I have been thinking a lot about teaching lately (maybe now that I will not be teaching anymore) and I hope to write a series of a few blog posts about it. My first post here will be on grade inflation, specifically whether curving is an effective way to combat it.

A popular method to combat grade inflation seems to be to impose a set curve for all classes. That is, for example, the top 25% of students get As, the next 35% get Bs and the bottom 40% gets Cs, Ds, and Fs (which is the guideline for my class). While this necessarily avoids the problem of too many people getting As, it can be a bit too rigid, which I will show below.

In the class I teach, there are ~350 students, who are spread among three lectures. I will investigate what effect the splitting of students into the lectures has on their grade. First, I will make an incredibly simple model where I assume there is a "true" ranking of all the students. That is, if all the students were actually in one big class, this would be the ordering of their grades in the course. I will assume that the assessments given to the students in the classes they end up in are completely fair. That is, if their "true" ranking is the highest of anyone in the class, they will get the highest grade in the class and if their "true" ranking is the second highest of anyone in the class they will get the second highest grade and so on. I then assign students randomly to three classes and see how much their percentile in the class fluctuates based on these random choices. This is shown below


The straight black line shows the percentile a student would have gotten had the students been in one large lecture. The black curve above and below it shows the 90% variability in percentile due to random assignment.

We see that even random assignment can cause significant fluctuations, and creates variability particularly for the students in the "middle of the pack." Most students apart from those at the top and bottom could have their letter grade change by a third of a letter grade just due to how the classes were chosen.

Further, this was just assuming the assignment was random. Often, the 8 am lecture has more freshman because they register last and lectures at better times are likely to fill up. There may also be a class that advanced students would sign up for that conflicts with one of the lecture times of my course. This would cause these advanced students to prefer taking the lectures at other times. These effects would only make the story worse from what's shown above.

We have also assumed that each class has the ability to perfectly rank the students from best to lowest. Unfortunately, there is variability in how exam problems are graded and how good questions are at distinguishing students, and so the ranking is not consistent between different lectures. This would tend to randomize positions as well.

Another issue I take with this method of combating grade inflation is that it completely ignores whether students learned anything or not. Since the grading is based on a way to rank students, even if a lecturer is ineffective and thus the students in the course don't learn very much, the student's score will be relatively unchanged. Now, it certainly seems unfair for a student to get a bad grade because their lecturer is poor, but it seems like any top university should not rehire anyone who teaches so poorly that their students learn very little (though I know this is wishful thinking). In particular, an issue here is that how much of the course material students learned is an extremely hard factor to quantify without imposing standards. However, standardized testing leads to ineffective teaching methods (and teaching "to the test") and is clearly not the answer. I'm not aware of a good way to solve this problem, but I think taking data-driven approaches to study this would be extremely useful for education.

In my mind, instead of imposing fixed grade percentages for each of the classes, the grade percentages should be imposed on the course as a whole. That is, in the diagram above, ideally the upper and lower curves would be much closer to the grade in the "true ranking" scenario. Thus, luck or scheduling conflicts have much less of an effect on a students grade. Then the question becomes how to accomplish this. This would mean that sometimes classes would get 40% As and maybe sometimes 15% As, but it would be okay because this is the grade the students should get.

My training in machine learning suggests that bagging would be a great way to reduce the variance. This would mean having three different test problems on each topic and randomly assigning each student one of these three problems. Apart from the logistic nightmare this would bring about, this would really only work when one lecturer is teaching all the classes. For example, if one of the lecturers is much better than another or likes to do problems close to test problems in lecture, then the students will perform better relative to students in other lectures because of their lecturer. To make this work, there needs to be a way to "factor out" the effect of the lecturer.

Another method would be to treat grading more like high school and set rigid grade distributions. The tests would then have to be written in a way such that we'd expect the outcome of the test to follow the guideline grade distributions set by the university, assuming the students in the class follow the general student population. Notably, the test is not written so that the particular course will follow the guideline grade distribution. Of course this is more work than simply writing a test, and certainly, the outcome of a test is hard to estimate. Often I've given tests and been surprised at the outcome, though this is usually due to incomplete information, such as not knowing the instructor did an extremely similar problem as a test problem in class.

One way to implement this would be to look at past tests and look at similar problems, and see how students did on those problems. (Coincidentally, this wasn't possible to do until recently when we started using Gradescope). This gives an idea how we would expect students to perform, and we can use this data to weight the problem appropriately. Of course, we (usually) don't want to give students problems they'll have seen while practicing exams and so it is hard to define how similar a problem is. To do this right requires quite a bit of past data on tests, and as I mentioned earlier this isn't available. Similar problems given by other professors may help, but then we run into the same problem above in that different lecturers will standardize differently from how they decide to teach the course.

Without experimenting with different solutions, it's impossible to figure out what the best solution is, but it seems crazy to accept that curving classes is the best way. Through some work, there could be solutions that encourage student collaboration, reward students for doing their homework (I hope to write more on this in the future) instead of penalizing them for not doing their homework, and take into account how much students are actually learning.

Code for the figure is available here.

Topic Modeling and Gradescope

In this post, I'll be looking at trends in exam responses of physics students. I'll be looking at the Gradescope data from a midterm that my students took in a thermodynamics and electromagnetism course. In particular, I was interested if things students get right correlate with the physics expectation. For example, I might expect students who were able to apply Gauss's law correctly to be able to apply Ampere's law correctly as the two are quite similar.

I'll be using nonnegative matrix factorization (NMF) for the topic modeling. This is a technique that is often applied to topic modeling for bodies of text (like the last blog post). The idea of NMF is to take a matrix with positive entries, A, and find matrices W and H, also with positive entries, such that

 A = WH.

Usually, W and H will be chosen to be low-rank matrices and the equality above will be approximate. Then, a vector in A is now expressed as the positive linear combination of the small number of rows (topics) of W. This is natural for topic modeling as everything is positive, meaning that cancellations between the rows of W cannot occur.

The data

For each student and for each rubric item, Gradescope stores whether the grader selected that item for the student. Each rubric item has points associated with it, so I use this as the weight of the matrix to perform the NMF on. The problem, though, is that some rubric items correspond to points being taken off from the student, which is not a positive quantity. In this case, I took a lack of being penalized to be the negative of the penalty, and those that were penalized had a 0 entry in that position of the matrix.

There was also 0 point rubric items (we use these mostly as comments that apply to many students). I ignore these entries. But finding a way to incorporate this information could also be interesting.

Once the matrix is constructed, I run NMF on it to get the topic matrix W and the composition matrix H. I look at the entries in W with the highest values, and these are the key ideas on the topic.


The choice of the number of topics (the rank of W and H above) was not obvious. Ideally, it would be a small number (like 5) so it would be easy to just read off the main topics. However, this seems to pair together some unrelated ideas by virtue of them being difficult (presumably because the better students did well on these points). Another idea was to look at the error \| A - WH\|_2 and to determine where it flattened out. As apparent below, this analysis suggested that adding more topics after 20 did not help to reduce the error in the factorization.

With 20 topics, it was a pain to look through all of them to determine what each topic represented. Further, some topics were almost identical. One such example was a problem relating to finding the work in an adiabatic process. Using the first law of thermodynamics and recognizing the degrees of freedom were common to two topics. However, one topic had to be able to compute the work correctly, as the other one did not. This is probably an indication that the algebra leading up to finding the work was difficult for some. I tried to optimize these problems and ultimately chose 11 topics, which seems to work reasonably well.

Some "topics" are topics simply by virtue of being worth many points. This would be rubric items with entries such as "completely correct" or "completely incorrect." This tends to hide the finer details that in a problem (e.g. a question testing multiple topics, which is quite common in tests we gave). These topics often had a disproportionate number of points attributed to them. Apart from this, most topics seemed to have roughly the same number of points attributed to them.

Another unexpected feature was that I got a topic that negatively correlated with one's score. This was extremely counter-intuitive as in NMF each topic can only positively contribute to score, so having a significant component in a score necessarily means having a higher score. The reason this component exists is that it captures rubric items that almost everyone gets right. A higher scoring student will get the points in these rubric items from other topics that also contain this rubric item. Most of the other topics had high contributions from rubric items that fewer than 75% of students obtained.

Many topics were contained within a problem, but related concepts across problems did cluster as topics. For example, finding the heat lost in a cyclic process correlated with being able to equate heat in to heat out in another problem. However, it was more common for topics to be entirely contained in a problem.

The exam I analyzed was interesting as we gave the same exam to two groups of students, but had different graders grade the exams (and therefore construct different rubrics). Some of the topics found (like being able to calculate entropy) were almost identical across the two groups, but many topics seemed to cluster rubric items slightly differently. Still, the general topics seemed to be quite consistent between the two exams.


The plots show a student's aptitude in a topic as a function of their total exam score for three different topics. Clearly, depending on the topic the behaviors can look quite different.

Looking at topics by the student's overall score has some interesting trends as shown above. As I mentioned before, there are a small number (1 or 2) topics which students with lower scores will "master," but these are just the topics that nearly all of the students get points for. A little over half the topics are ones which students who do well excel at, but where a significant fraction of lower scoring students have trouble with. The example shown above is a topic that involves calculating the entropy change and heat exchange when mixing ice and water. This may be indicative of misconceptions that students have in approaching these problems. My guess here would be that students did not evaluate an integral to determine the entropy change, but tried to determine it in some other way.

The rest of the topics (2-4) were topics where the distribution of points was relatively unrelated to the total score on the exam. In the example shown above, the topic was calculating (and determining the right signs) of work in isothermal processes, which is a somewhat involved topic. This seems to indicate that success in this topic is unrelated to understanding the overall material. It is hard to know exactly, but my guess is that these topics test student's ability to do algebra more than their understanding of the material.

I made an attempt to assign a name to each of the topics that were found by analyzing a midterm (ignoring the topic that negatively correlated with score). The result was the following: heat in cyclic processes, particle kinetics, entropy in a reversible system, adiabatic processes, work in cyclic processes, thermodynamic conservation laws, particle kinetics and equations of state, and entropy in an irreversible system. This aligns pretty well with what I would expect students to have learned by their first midterm in the course. Of course, not every item in each topic fit nicely with these topics. In particular, the rubric items that applied to many students (>90%) would often not follow the general topic.

Ultimately, I was able to group the topics into four major concepts: thermodynamic processes, particle kinetics and equations of state, entropy, and conservation laws. The following spider charts show various student's abilities in each of the topics. I assumed each topic in a concept contributed equally to the concept.


Aptitude in the four main concepts for an excellent student (left) an average student (middle) and a below average student (right).


Since the data is structured to be positive and negative (points can be given or taken off), there may be other matrix decompositions that deal with the data better. In principle, this same analysis could be done using not the matrix of points, but the matrix of boolean (1/0) indicators of rubric items. This would also allow us to take into account the zero point rubric items that were ignored in the analysis. I do not know how this would change the observed results.

I had to manually look through the descriptions of rubric items that applied to each topic and determine what the topic being represented was. An exciting (though challenging) prospect would be to be able to automate this process. This is tricky, though, as associations that S and entropy are the same could be tricky. There may also be insights from having "global" topics across different semesters of the same course in Gradescope.

The code I used for this post is available here.