In this post I'll be looking at trends in exam responses of physics students. I'll be looking at the Gradescope data from a midterm that my students took in a thermodynamics and electromagnetism course. In particular, I was interested if things students get right correlate with the physics expectation. For example, I might expect students who were able to apply Gauss's law correctly to be able to apply Ampere's law correctly as the two are quite similar.

I'll be using nonnegative matrix factorization (NMF) for the topic modeling. This is a technique that is often applied to topic modeling for bodies of text (like the last blog post). The idea of NMF is to take a matrix with positive entries, $A,$ and find matrices $W$ and $H,$ also with positive entries, such that

Usually, $W$ and $H$ will be chosen to be low rank matrices and the equality above will be approximate. Then, a vector in $A$ is now expressed as the positive linear combination of the small number of rows (topics) of $W.$ This is natural for topic modeling as everything is positive, meaning that cancellations between the rows of $W$ cannot occur.

The data

For each student and for each rubric item, Gradescope stores whether the grader selected that item for the student. Each rubric item has points associated with it, so I use this as the weight for the matrix to perform the NMF on. The problem, though, is that some rubric items correspond to points being taken off from the student, which is not a positive quantity. In this case, I took a lack of being penalized to be the negative of the penalty, and those that were penalized had a 0 entry in that position of the matrix.

There were also 0 point rubric items (we use these mostly as comments that apply to many students). I ignore these entries. But finding a way to incorporate this information could also be interesting.

Once the matrix is constructed, I run NMF on it to get the topic matrix $W$ and the composition matrix $H.$ I look at the entries in $W$ with the highest values, and these are the key ideas in the topic.

Results

The choice of the number of topics (the rank of $W$ and $H$ above) was not obvious. Ideally it would be a small number (like 5) so it would be easy to just read off the main topics. However, this seems to pair together some unrelated ideas by virtue of them being difficult (presumably because the better students did well on these points). Another idea was to look at the error $\| A - WH\|_2$ and to determine where it flattened out. As apparent below, this analysis suggested that adding more topics after 20 did not help to reduce the error in the factorization.

With 20 topics, it was a pain to look though all of them to determine what each topic represented. Further, some topics were almost identical. One such example was a problem relating to finding the work in an adiabatic process. Using the first law of thermodynamics and recognizing the degrees of freedom were common to two topics. However, one topic had being able to compute the work correctly, as the other one did not. This is probably an indication that the algebra leading up to finding the work was difficult for some. I tried to optimize between these problems and ultimately chose 11 topics, which seems to work reasonably well.

Some "topics" are topics simply by virtue of being worth many points. This would be rubric items with entries such as "completely correct" or "completely incorrect." This tends to hide the finer details that in a problem (e.g. a question testing multiple topics, which is quite common in tests we gave). These topics often had a disproportionate number of points attributed to them. Apart from this, most topics seemed to have roughly the same number of points attributed to them.

Another unexpected feature was that I got a topic that negatively correlated with one's score. This was extremely counter-intuitive as in NMF each topic can only positively contribute to score, so having a significant component in a score necessarily means having a higher score. The reason this component exists is that it captures rubric items that almost everyone gets right. A higher scoring student will get the points in these rubric items from other topics that also contain this rubric item. Most of the other topics had high contributions from rubric items that fewer than 75% of students obtained.

Many topics were contained within a problem, but related concepts across problems did cluster as topics. For example, finding the heat lost in a cyclic process correlated with being able to equate heat in to heat out in another problem. However, it was more common for topics to be entirely contained in a problem.

The exam I analyzed was interesting as we gave the same exam to two groups of students, but had different graders grade the exams (and therefore construct different rubrics). Some of the topics found (like being able to calculate entropy) were almost identical across the two groups, but many topics seemed to cluster rubric items slightly differently. Still, the general topics seemed to be quite consistent between the two exams.

The plots show a student's aptitude in a topic as a function of their total exam score for three different topics. Clearly, depending on the topic the behaviors can look quite different.

Looking at topics by the student's overall score has some interesting trends as showed above. As I mentioned before, there are a small number (1 or 2) topics which students with lower scores will "master," but these are just the topics that nearly all of the students get points for. A little over half the topics are ones which students who do well excel at, but where a significant fraction of lower scoring students have trouble with. The example shown above is a topic that involves calculating the entropy change and heat exchange when mixing ice and water. This may be indicative of misconceptions that students have in approaching these problems. My guess here would be that students did not evaluate an integral to determine the entropy change, but tried to determine it in some other way.

The rest of the topics (2-4) were topics where the distribution of points was relatively unrelated to the total score on the exam. In the example shown above above, the topic was calculating (and determining the right signs) of work in isothermal processes, which is a somewhat involved topic. This seems to indicate that success in this topic is unrelated to understanding the overall material. It is hard to know exactly, but my guess is that these topics test student's ability to do algebra more than their understanding of the material.

I made an attempt to assign a name to each of the topics that were found by analyzing a midterm (ignoring the topic that negatively correlated with score). The result was the following: heat in cyclic processes, particle kinetics, entropy in a reversible system, adiabatic processes, work in cyclic processes, thermodynamic conservation laws, particle kinetics and equations of state, and entropy in an irreversible system. This aligns pretty well with what I would expect students to have learned by their first midterm in the course. Of course, not every item in each topic fit nicely with these topics. In particular, the rubric items that applied to many students (>90%) would often not follow the general topic.

Ultimately, I was able to group the topics into four major concepts: thermodynamic processes, particle kinetics and equations of state, entropy, and conservation laws. The following spider charts show various student's abilities in each of the topics. I assumed each topic in a concept contributed equally to the concept.

Aptitude in the four main concepts for an excellent student (left) an average student (middle) and a below average student (right).

Conclusions

Since the data is structured to be positive and negative (points can be given or taken off), there may be other matrix decompositions that deal with the data better. In principle, this same analysis could be done using not the matrix of points, but the matrix of boolean (1/0) indicators of rubric items. This would also allow us to take into account the zero point rubric items that were ignored in the analysis. I do not know how this would change the observed results.

I had to manually look through the descriptions of rubric items that applied to each topic and determine what the topic being represented was. An exciting (though challenging) prospect would be to be able to automate this process. This is tricky, though, as associations that $S$ and entropy are the same could be tricky. There may also be insights from having "global" topics across different semesters of the same course in Gradescope.

The code I used for this post is available here.