# Optimization Functions in Deep Learning

This will be a short post about optimization functions in deep learning. For simplicity, I'll show how these optimization functions work with a simple 1D example. Suppose that the loss function, $L,$ as a function of weight, $w,$ looks like the following.

The flat region for values of the weight between 1 and 2 is a little contrived, but this will help depict some of the advantages of the more complicated optimization functions. Saddle points are ubiquitous in a deep learning training landscape, so these are valid concerns to consider.

With the standard gradient descent described in a previous post, the optimization needs to entirely avoid the flat region. In this region, the gradient is zero, and thus no update to the weights can be made. As shown below, for positive starting values of the weight, it is hard to avoid the zero gradient region. Even a negative starting value can end up in the flat region if the learning rate is large enough.

Loss as a function of optimization steps starting with a weight of -10 (top) and a weight of 10 (bottom) for different learning rates. Note that the loss flattening out at $10^0$ is an artifact of the flat region of the loss function.

Decaying the learning rate as the optimization time increases also doesn't really help here. The biggest problem is the algorithm will get stuck in the flat region, and changing the learning rate won't fix this issue.

Momentum

This example shows that the crux of the problem is that the update goes straight to zero when a flat region is encountered. The idea behind momentum is to track a momentum variable that updates with each step.

In these expressions, a subscript denotes the vector component to allow the expressions to generalize to multiple dimensions. There is now a new parameter, $\alpha,$ that tracks how much memory of previous gradients the optimization process should have. Note that when $\alpha = 0,$ this reduces to standard gradient descent.

Implementing this on the function above, we see now that as long as the momentum parameter is large enough, the flat region can be avoided, even with the same learning rate as before. On top of that, the convergence is far faster than any of the choices above.

Loss as a function of optimization steps starting with a weight of 10 for different momentum parameters.

RMSprop

Typical deep learning models have many orders of magnitude more parameters (100000+). Suppose that during the training process, the gradients start out large for some of these variables but not others. There will be a significant update for those variables with large gradients. The gradients of the other parameters might become larger as the training proceeds. However, if the learning rate is decaying through the training process, these updates will not change the parameters as much as if these large gradients had appeared earlier in the training process.

Thus, RMSprop attempts to tune the learning rate for each parameter. If the gradient over the previous steps has been large for that parameter, the learning rate for that parameter will be smaller. This allows for smaller steps to refine the value of the parameter, as discussed in a previous post. Conversely, if the gradient over the last few steps has been small, we may want a larger learning rate to be able to move more in that parameter. We don't want the memory of the gradient sizes to be too long since it seems perverse for the first gradients to affect the 100000th step of training. Thus, a moving average is used to keep track of the gradient magnitudes. The updates of RMSprop are the following.

We see that $\sqrt{r_i}$ is a moving average of the gradient magnitude for each parameter. The moving average is characterized by the parameter $\rho.$ $\delta$ is a parameter chosen such that denominator in the fraction does not diverge. $\eta,$ as before is the learning rate of the problem. Global learning rate decay can also be combined with this method to ensure learning rates for each parameter are bounded.

Since the update is always directly proportional to the gradient, for the problem at hand, RMSprop doesn't help and will get stuck much like traditional gradient descent.

If momentum works well, and RMSprop works well, then why not combine both? This is the idea behind Adam. The gradient descent updates of Adam are the following.

The two moving averages, corresponding to the momentum and gradient magnitude, are tracked and used for the update. Adam is my go-to optimizer when training deep learning models. Typically the only parameter that needs tuning is the learning rate, $\eta,$ and the default values of the other parameters ($\delta=10^{-8},\alpha=0.9,\rho=0.999$) tend to work really well. I applied Adam to the 1D example loss function with these default parameters and got some decent results. This looks a little worse than momentum, but this is because this problem is so simple. For most real-world problems Adam outperforms momentum.

Loss as a function of optimization steps using Adam optimizer with default parameters starting with a weight of 10 for different learning rates.

# Thoughts on Teaching

While I was getting my Ph.D., I taught almost every semester. I was a head TA for all but one semester which meant that I managed a team of TAs and worked closely with the course instructors, but didn't do any classroom teaching. I taught introductory electricity and magnetism the entire time, and as I saw various professors and various students go through the material, I noticed that our course really wasn't optimized to maximize how well students learn the course material. This was surprising to me as I was at an institution that is consistently rated highly by many university rating agencies.

I'll borrow a concept from Wait But Why of how remarkable advancements are. For example, my parents used computers the size of a room that has the computational capacity of a computer that now can be worn around the wrist. However, the way my parents were taught isn't too different from modern teaching. For the most part, lectures are still a lecturer standing at the front of the room writing on a blackboard. The lecturer may use a powerpoint presentation, and the students may be distracted from their laptops and phones, but the fundamental way of teaching hasn't really changed.

There are, of course, efforts to modernize learning through websites like Coursera and Khan Academy. I think the real benefit of these is real-time feedback to the instructor (in this case, the algorithms behind the course). In the class I taught, the instructor only got individual feedback from the questions students asked (which not everyone does), as they never saw student's homework or test scores except in aggregate. However, taking courses through these online services is not recognized as sufficient mastery when compared to a university course, and I doubt that this will change anytime soon. So, I thought it would be interesting to consider what keeps universities from making efforts to make their teaching more effective.

Universities are not incentivized to teach well
When I started teaching, I thought that the metric I should optimize for was student's understanding of the course material. The problem is that this is not easy to measure. Most grading at the university level seems to be done on a curve (which as I've written before, I have concerns about) to sidestep this issue. Even grading on a curve reasonably is not trivial as writing good exams is an art. I think only one instructor I worked with had truly mastered it. Poorly written, or even poorly administered, exams provide a weak signal of student understanding. As instructors write different exams, it's another challenge to combine signals from separate courses taught by different instructors to get an aggregate view of how students learn.

This is an idea that is talked about in Cathy O'Neil's Weapons of Math Destruction. Because educational quality is something that is ill-defined, rating agencies, like US News, use proxies that seem to correlate with educational quality to rank universities. This sets up a perverse incentive structure, as the university is offering the service of teaching (though some may argue the service they provide is really the degree) but they optimize around the metrics the rating agency chooses instead of actually making the learning experience great. Companies decide to hire from the perceived best schools, which means that if a high school student wants to set up for a great career, it's in their best interest to apply for these top universities even if it means they will not get the best education.

Looking at the US News' metric, it seems that 20-30% of the metric that goes into ranking colleges actually directly comes from how teaching is conducted, considering class size and funds allocated to teaching students. An almost equal proportion (22.5%) of the score is reputation, which would be hard to assess from an outsider who has not taken courses at the university. Part of this reputation could be the university's willingness to try techniques such as more active learning in classes, but this is handled in an ad-hoc way, and I would like to see it be a more direct part of the metric.

Professors are not rewarded for teaching well
I will preface this section by saying that I worked with some amazing instructors while being a head TA. I don't want to belittle the great work that they do. The fact is, though, that there are instructors who are not so good who are still teaching students and that is the topic of the rest of this post.

I found that as I worked to optimize my students' understanding of course material, that not all of the professors shared the same goal. I found that some professors (but definitely not all) were optimizing for student happiness, under the constraints of having to rank students by performance. This makes sense from the instructor's point of view as the metrics the department sees of the instructor is the grade distribution they assign and the feedback they receive from students (and maybe a peer review by other faculty members). I've always found it shocking that the department did not care what the course TAs thought of the professors, though admittedly this could be a noisy signal as some classes only have one TA.

I felt that some of the instructors who I felt were ineffective at getting their students to understand the material got the highest ratings by students. In fact, it's been shown that student surveys are another metric that is a poor proxy for teaching effectiveness.

In fact, tenure makes it hard to get rid of a professor, and I found that often the tenured professors were some of the worst instructors at optimizing student understanding. While being a Nobel prize eligible physicist is an excellent qualification for getting tenure, this really isn't a necessary or even sufficient condition to be a great lecturer¹. On the flip side, some lecturers were some of the best instructors I worked with, but they are hired on limited term contracts and are paid little compared to their tenured colleagues.

Professors are averse to change
As a head TA, I worked with an instructor who was technologically illiterate and could not do much more on his computer than respond to email. There was another professor whose methods I had pedagogical concerns about. When I raised these to him, he retorted with something along the lines of "no, we're not changing it because this is the way I've done it for 20 years and the students like it this way." I worked with others who are oversubscribed with research and family obligations that they can't be bothered to learn a new tool for the course they are teaching. This isn't all the instructors I worked with, but the instructors described here are still teaching and will continue to teach unless broad changes are made.

Where I do see some willingness to adopt new tools is when work is being automated on the professor's side. In our course, we used Mastering Physics to automate the grading of homework assignments and reduce the grading burden on TAs. One concern I have with Mastering Physics is that it locks universities into multi-year contracts, creating friction to move to a new system. Combined with the instructor's desires not to learn a new system, this causes Mastering Physics to stick around, and there is not much pressure for Mastering Physics to improve their services. I've found that this has led to the course continuing to use a suboptimal system without much hope for improvement.

I've noticed some quick wins for Mastering Physics that were not implemented in the time I used the service. For example, I noticed that 75% of the wrong answers my students submitted had incorrect units. It would be relatively easy for Mastering Physics to implement a check for this and give the students more useful feedback so that they learn their mistakes more easily.

Apparently, when Mastering Physics was first introduced into introductory physics courses, the professors found, using some standardized physics understanding tests, that understanding of physics actually increased after using Mastering Physics. The professors attributed this to the fact that the students got instant feedback on whether their answers were correct. This sounds great for Mastering Physics, but the problem is that the solution to the problems are now easily accessible and so it is easy to get credit for the homework problems without actually solving them.

I have no clue what fraction of students actually did the Mastering Physics problems. Sometimes we put problems very similar (or almost identical) to homework problems on the exams, and I'm quite surprised that many students will struggle to do the problems on the exam. Even creating new problems is not a solution, as there are services like Chegg that have others work out problems for the student. In theory, this was possible pre-internet, but the scale of the internet makes it that much easier for students to exploit.

Even beyond that, Mastering Physics and other products like WebAssign are limited in scope. They're rarely used outside of introductory classes because they can really only check the final answer. I do think there is more that can be automated in grading assignments. A lot of hard science and math questions boil down to choosing the correct starting point and showing a logical progression of steps to the final result. This is entirely within the realm of tasks a computer can do. Even for other disciplines where answers may be short answers, advancements in NLP are probably at the point where a solution could be compared with sample response(s) with quite a high accuracy. We would still need a human to check to make sure the model wasn't performing crazily, but this would considerably speed up the grading process.

The Future
Getting rating agencies, universities, and professors to care about teaching effectiveness is not an easy task. To me, the fundamental issue is that teaching effectiveness cannot be measured easily and poor proxies are being used to measure it. If rating agencies could rate universities based on teaching effectiveness, this would put pressure on universities to care more, which would then put pressure on instructors to care more.

I've mentioned before about how analyzing Gradescope data could be useful in helping instructors get a better idea of what each individual student has actually learned. As mentioned earlier, this is quite valuable as the lecturers for the course I taught never graded a homework assignment or a test. I could foresee in the future instructors being given a dashboard where they can look at what individual students understand while also being able to segment by factors such as attendance rate to get near real-time feedback about how they are performing.

But I also think Gradescope may be able to tackle the teaching effectiveness problem. By offering a service that allows for faster, less-biased grading, instructors are likely to adopt the service. I've seen that instructors respond well when tedious parts of their work are automated. As adoption increases, Gradescope gets a unique set of data, with different courses across many different disciplines across many different universities. Making sense of all the data is a challenge, but if I had to guess, I would say that this data has insights about teaching effectiveness that can hopefully drive some of the necessary changes to education.

1. This may not be true in every discipline. I have taken introductory anthropology classes in which the instructor's research was a relevant topic of discussion.

For some background, there is no way I would have known in college that I would not have wanted to apply for postdocs after my Ph.D. My career plan then was to go through grad school, be successful, and someday end up a professor. During my Masters' degree, I realized that this may not be the life I wanted. My advisers at the time (who were married to each other) would routinely be at the institute until 8 or 9 in the evening and would come in early in the morning. I often wondered if they discussed much else than physics at home. During my Ph.D., I also observed my advisers working long hours and working on weekends and holidays. Once my advisor even told me that anyone I (a physicist) dated should expect for me to be unavailable on weekends and holidays when there was important physics to do. Still, I was in too deep at this point, so I determined the minimal amount of work I needed to get a Ph.D. and completed that. Now that I have a job as a data scientist, the benefits of my Ph.D. seem to be the people I met and the connections I made (which did help me get the job), but the actual knowledge I gained during the degree has been mostly useless. Looking back, I'm reminded of some of the critical issues with UC Berkeley and academia and have outlined them here.

For some reason, during grad school, you are expected to volunteer your time, with no pay or credit. This is especially apparent during the summer when my contract said I was supposed to work 19 hours/week, but my adviser expected me to come in 40+ hours/week. What also shocked me is when I told fellow grad students about this issue, they were not even aware they were only supposed to be working halftime. Further, if an adviser does not have funding and the student teaches during summer to cover costs, the student has no obligation to do research during that summer, but this isn't communicated to the student. These facts are rarely spelled out. The sad thing is, this doesn't end with grad school. I know post-docs (at my institution and others) who have told me that their contracts say they should work around 40 hours/week, but are routinely actually expected to work 50-60 hours/week.

My last semester in grad school I was not enrolled in classes, not getting paid for research and was just working on my thesis. Hence, there was no obligation for me to do anything that was not for my benefit. Still, my advisers tried to guilt me into coming into the office often (with threats of not signing my thesis) and to continue doing work. It still bothers me I paid the university (for full disclosure, I made plenty of money during a summer internship to cover these costs) for the "opportunity" to work for the research group for that semester.

When research is done for course credit, the lines are blurred a bit. The time spent on the "course" is not necessarily fixed. I would think that if it is a "course," then research obligations related to the course should start when the semester starts and end when the semester ends. Certainly, the "course" should not require a student to attend a meeting on a university holiday or weekend (which happened to me during grad school). Most universities have standards and expect professors teaching courses to be available for their students. Some graduate students I know talk to their advisors a couple of times a semester which I would think hardly respects these standards. This has also led me to question what can and cannot be asked of a student in a course. For example, if there were a "course" in T-shirt making that made students work in sweat-shop like conditions to get a grade in a class, would this be legal? While not a tangible product, research for credit is a somewhat similar scenario where the students are producing papers that will ultimately benefit the professor's fame (and a small chance of advancing the student's career).

Another issue is that, as budgets get tighter, expenses get passed off to students. During my time at Berkeley, keeping pens and paper in a storeroom was deemed too expensive. The department suggested each research group make their own purchases of these items through the purchasing website. Not only is this a waste of graduate student's time, but the website was so terrible that often it was easier just to buy items and not get reimbursed for them. Most research is done on student's personal laptops, and while a necessity to continue work, rarely is their support from the university to make this purchase or pay for maintenance when it is used for research work. There is no IT staff, so again it falls on students and post-docs to waste their time dealing with network issues and computer outages rather than focusing on the work that is actually interesting.

I apparently have a roughly 50% chance of "making it" as a professor, which is mostly because I attended a good institution and published in a journal with a high impact factor. Ph.D. exit surveys have found similar rates for the fraction of students that stay in academia. Yet, the general expectation in academia is that all of the students will go on to have a research-focused career (I know some professors who look down upon a teaching-focused career as well even though this is still technically academia).

Even after making it extremely clear to my adviser that I had no intentions of pursuing a postdoc, he told me that I should think about applying. He went so far as to say data science (my chosen profession) was a fad and would probably die out in a few years. Once during his class, he seemed quite proud of the fact that between industry and academic jobs, most of his students had wound up staying in physics. While my adviser wasn't too unhappy with me taking courses unrelated to research, many advisers will actively encourage their students to focus on research and not take classes. This is terrible advice, considering useful skills in computer science and math are often crucial to getting jobs outside of academia.

The physics curriculum, in general, is flawed. At no point in a typical undergraduate/graduate curriculum are there courses on asymptotic analysis, algorithms, numerical methods, or rigorous statistics, which are all useful both inside and outside of physics. Often these are assumed known or trivial, yet this gives physicists a poor foundation and can lead to problems when working on relevant problems. I took classes on all of these topics, though they were optional, and they are proving to be more useful to me than most of the physics courses I took or even research I conducted as a graduate student.

This is a problem at the institutional level as well. For physics graduate students at UC Berkeley, the qualifying exam is set up to test students on topics of the student's choosing. I chose numerical methods and statistics as my main topic, but my committee asked me no questions on numerical methods and statistics. To be fair, one member tried, but he didn't know what to ask, but then again, he could have prepared something to ask since the topics are announced months before the exam. Instead, my committee asked me general plasma physics and quantum mechanics questions which were not topics I had chosen. I failed to answer those questions (and have even less ability to solve it now), but somehow still passed the exam. This convinced me that the exam was nothing more than a formality, yet no one was willing to make changes to make the exam more useful. An easy fix would be to frame the oral exam as interview practice, as most graduate students have no experience with interviews.

There are many student-led efforts to try to make transitioning into a non-academic job easier at UC Berkeley. But the issue is, for the most part, they are student-led and have little faculty support. I was very involved in these groups, and it was quite clear that my adviser did not want me to be involved. Considering that involvement is why I have a job right now, I would say I made the right choice.

As I alluded to before, even though I was receiving no money or course credit from my research group in my last semester, my advisers threatened to not sign my thesis unless I completed various research tasks (some unrelated to the actual thesis). I've talked to others that have had similar experiences, and I hate to say I'm confident that this extends beyond research tasks in some research groups. This could be solved by making the thesis review process anonymous and have the advisers take no part in it, but there seem to be no efforts to make this happen.

Another fault is that advisers can get rid of students on a whim. I've known people who have sunk three years into a research group only to be told that they cannot continue. Then, the student has to make a decision to start from zero and spend a ridiculous amount of their life in grad school or leave without a degree rendering the three years in the research group irrelevant. Because of this power, professors can make their students work long hours and come in on weekends and holidays. If the student does not oblige, the time the student has already put into the research group is just wasted time.

It doesn't help that research advisers are usually also respected members of their research area. That is, if a student decides to stay in academia (particularly in the same research area as grad school), the word of their adviser could make or break their career. This again gives an opportunity for advisers to request favors from students.

Further, the university has every motive to protect professors, especially those with tenure, but not graduate students. This became embarrassingly clear, for example, in how the university handles Geoff Marcy before Buzzfeed got a hold of the news. If professors can ignore rules set by the University (and sometimes laws) with little repercussion, there is little faith graduate students can have that new rules the university creates to any of the problems mentioned here will be followed. I don't want to belittle the (mostly student-led) efforts to make sexual harassment less of an issue at UC Berkeley, but ultimately the only change I can pinpoint is that there is now more sexual harassment training, which has been shown by research at UC Berkeley to lead to more sexual harassment incidents.

Ultimately, graduate school and a career in academia work for some people. Some people are passionate about science and love working on their problems, even if that means making a few sacrifices. Progression of science is a noble, necessary goal, and I am glad there are people out there to make it happen. My hope is that many of the problems mentioned here can be rectified so that the experience for those people and also the people who realize that academia isn't for them can succeed on a different path.

I have been thinking a lot about teaching lately (maybe now that I will not be teaching anymore) and I hope to write a series of a few blog posts about it. My first post here will be on grade inflation, specifically whether curving is an effective way to combat it.

A popular method to combat grade inflation seems to be to impose a set curve for all classes. That is, for example, the top 25% of students get As, the next 35% get Bs and the bottom 40% gets Cs, Ds, and Fs (which is the guideline for my class). While this necessarily avoids the problem of too many people getting As, it can be a bit too rigid, which I will show below.

In the class I teach, there are ~350 students, who are spread among three lectures. I will investigate what effect the splitting of students into the lectures has on their grade. First, I will make an incredibly simple model where I assume there is a "true" ranking of all the students. That is, if all the students were actually in one big class, this would be the ordering of their grades in the course. I will assume that the assessments given to the students in the classes they end up in are completely fair. That is, if their "true" ranking is the highest of anyone in the class, they will get the highest grade in the class and if their "true" ranking is the second highest of anyone in the class they will get the second highest grade and so on. I then assign students randomly to three classes and see how much their percentile in the class fluctuates based on these random choices. This is shown below

The straight black line shows the percentile a student would have gotten had the students been in one large lecture. The black curve above and below it shows the 90% variability in percentile due to random assignment.

We see that even random assignment can cause significant fluctuations, and creates variability particularly for the students in the "middle of the pack." Most students apart from those at the top and bottom could have their letter grade change by a third of a letter grade just due to how the classes were chosen.

Further, this was just assuming the assignment was random. Often, the 8 am lecture has more freshman because they register last and lectures at better times are likely to fill up. There may also be a class that advanced students would sign up for that conflicts with one of the lecture times of my course. This would cause these advanced students to prefer taking the lectures at other times. These effects would only make the story worse from what's shown above.

We have also assumed that each class has the ability to perfectly rank the students from best to lowest. Unfortunately, there is variability in how exam problems are graded and how good questions are at distinguishing students, and so the ranking is not consistent between different lectures. This would tend to randomize positions as well.

Another issue I take with this method of combating grade inflation is that it completely ignores whether students learned anything or not. Since the grading is based on a way to rank students, even if a lecturer is ineffective and thus the students in the course don't learn very much, the student's score will be relatively unchanged. Now, it certainly seems unfair for a student to get a bad grade because their lecturer is poor, but it seems like any top university should not rehire anyone who teaches so poorly that their students learn very little (though I know this is wishful thinking). In particular, an issue here is that how much of the course material students learned is an extremely hard factor to quantify without imposing standards. However, standardized testing leads to ineffective teaching methods (and teaching "to the test") and is clearly not the answer. I'm not aware of a good way to solve this problem, but I think taking data-driven approaches to study this would be extremely useful for education.

In my mind, instead of imposing fixed grade percentages for each of the classes, the grade percentages should be imposed on the course as a whole. That is, in the diagram above, ideally the upper and lower curves would be much closer to the grade in the "true ranking" scenario. Thus, luck or scheduling conflicts have much less of an effect on a students grade. Then the question becomes how to accomplish this. This would mean that sometimes classes would get 40% As and maybe sometimes 15% As, but it would be okay because this is the grade the students should get.

My training in machine learning suggests that bagging would be a great way to reduce the variance. This would mean having three different test problems on each topic and randomly assigning each student one of these three problems. Apart from the logistic nightmare this would bring about, this would really only work when one lecturer is teaching all the classes. For example, if one of the lecturers is much better than another or likes to do problems close to test problems in lecture, then the students will perform better relative to students in other lectures because of their lecturer. To make this work, there needs to be a way to "factor out" the effect of the lecturer.

Another method would be to treat grading more like high school and set rigid grade distributions. The tests would then have to be written in a way such that we'd expect the outcome of the test to follow the guideline grade distributions set by the university, assuming the students in the class follow the general student population. Notably, the test is not written so that the particular course will follow the guideline grade distribution. Of course this is more work than simply writing a test, and certainly, the outcome of a test is hard to estimate. Often I've given tests and been surprised at the outcome, though this is usually due to incomplete information, such as not knowing the instructor did an extremely similar problem as a test problem in class.

One way to implement this would be to look at past tests and look at similar problems, and see how students did on those problems. (Coincidentally, this wasn't possible to do until recently when we started using Gradescope). This gives an idea how we would expect students to perform, and we can use this data to weight the problem appropriately. Of course, we (usually) don't want to give students problems they'll have seen while practicing exams and so it is hard to define how similar a problem is. To do this right requires quite a bit of past data on tests, and as I mentioned earlier this isn't available. Similar problems given by other professors may help, but then we run into the same problem above in that different lecturers will standardize differently from how they decide to teach the course.

Without experimenting with different solutions, it's impossible to figure out what the best solution is, but it seems crazy to accept that curving classes is the best way. Through some work, there could be solutions that encourage student collaboration, reward students for doing their homework (I hope to write more on this in the future) instead of penalizing them for not doing their homework, and take into account how much students are actually learning.

Code for the figure is available here.

In this post, I'll be looking at trends in exam responses of physics students. I'll be looking at the Gradescope data from a midterm that my students took in a thermodynamics and electromagnetism course. In particular, I was interested if things students get right correlate with the physics expectation. For example, I might expect students who were able to apply Gauss's law correctly to be able to apply Ampere's law correctly as the two are quite similar.

I'll be using nonnegative matrix factorization (NMF) for the topic modeling. This is a technique that is often applied to topic modeling for bodies of text (like the last blog post). The idea of NMF is to take a matrix with positive entries, $A,$ and find matrices $W$ and $H,$ also with positive entries, such that

Usually, $W$ and $H$ will be chosen to be low-rank matrices and the equality above will be approximate. Then, a vector in $A$ is now expressed as the positive linear combination of the small number of rows (topics) of $W.$ This is natural for topic modeling as everything is positive, meaning that cancellations between the rows of $W$ cannot occur.

The data

For each student and for each rubric item, Gradescope stores whether the grader selected that item for the student. Each rubric item has points associated with it, so I use this as the weight of the matrix to perform the NMF on. The problem, though, is that some rubric items correspond to points being taken off from the student, which is not a positive quantity. In this case, I took a lack of being penalized to be the negative of the penalty, and those that were penalized had a 0 entry in that position of the matrix.

There was also 0 point rubric items (we use these mostly as comments that apply to many students). I ignore these entries. But finding a way to incorporate this information could also be interesting.

Once the matrix is constructed, I run NMF on it to get the topic matrix $W$ and the composition matrix $H.$ I look at the entries in $W$ with the highest values, and these are the key ideas on the topic.

Results

The choice of the number of topics (the rank of $W$ and $H$ above) was not obvious. Ideally, it would be a small number (like 5) so it would be easy to just read off the main topics. However, this seems to pair together some unrelated ideas by virtue of them being difficult (presumably because the better students did well on these points). Another idea was to look at the error $\| A - WH\|_2$ and to determine where it flattened out. As apparent below, this analysis suggested that adding more topics after 20 did not help to reduce the error in the factorization.

With 20 topics, it was a pain to look through all of them to determine what each topic represented. Further, some topics were almost identical. One such example was a problem relating to finding the work in an adiabatic process. Using the first law of thermodynamics and recognizing the degrees of freedom were common to two topics. However, one topic had to be able to compute the work correctly, as the other one did not. This is probably an indication that the algebra leading up to finding the work was difficult for some. I tried to optimize these problems and ultimately chose 11 topics, which seems to work reasonably well.

Some "topics" are topics simply by virtue of being worth many points. This would be rubric items with entries such as "completely correct" or "completely incorrect." This tends to hide the finer details that in a problem (e.g. a question testing multiple topics, which is quite common in tests we gave). These topics often had a disproportionate number of points attributed to them. Apart from this, most topics seemed to have roughly the same number of points attributed to them.

Another unexpected feature was that I got a topic that negatively correlated with one's score. This was extremely counter-intuitive as in NMF each topic can only positively contribute to score, so having a significant component in a score necessarily means having a higher score. The reason this component exists is that it captures rubric items that almost everyone gets right. A higher scoring student will get the points in these rubric items from other topics that also contain this rubric item. Most of the other topics had high contributions from rubric items that fewer than 75% of students obtained.

Many topics were contained within a problem, but related concepts across problems did cluster as topics. For example, finding the heat lost in a cyclic process correlated with being able to equate heat in to heat out in another problem. However, it was more common for topics to be entirely contained in a problem.

The exam I analyzed was interesting as we gave the same exam to two groups of students, but had different graders grade the exams (and therefore construct different rubrics). Some of the topics found (like being able to calculate entropy) were almost identical across the two groups, but many topics seemed to cluster rubric items slightly differently. Still, the general topics seemed to be quite consistent between the two exams.

The plots show a student's aptitude in a topic as a function of their total exam score for three different topics. Clearly, depending on the topic the behaviors can look quite different.

Looking at topics by the student's overall score has some interesting trends as shown above. As I mentioned before, there are a small number (1 or 2) topics which students with lower scores will "master," but these are just the topics that nearly all of the students get points for. A little over half the topics are ones which students who do well excel at, but where a significant fraction of lower scoring students have trouble with. The example shown above is a topic that involves calculating the entropy change and heat exchange when mixing ice and water. This may be indicative of misconceptions that students have in approaching these problems. My guess here would be that students did not evaluate an integral to determine the entropy change, but tried to determine it in some other way.

The rest of the topics (2-4) were topics where the distribution of points was relatively unrelated to the total score on the exam. In the example shown above, the topic was calculating (and determining the right signs) of work in isothermal processes, which is a somewhat involved topic. This seems to indicate that success in this topic is unrelated to understanding the overall material. It is hard to know exactly, but my guess is that these topics test student's ability to do algebra more than their understanding of the material.

I made an attempt to assign a name to each of the topics that were found by analyzing a midterm (ignoring the topic that negatively correlated with score). The result was the following: heat in cyclic processes, particle kinetics, entropy in a reversible system, adiabatic processes, work in cyclic processes, thermodynamic conservation laws, particle kinetics and equations of state, and entropy in an irreversible system. This aligns pretty well with what I would expect students to have learned by their first midterm in the course. Of course, not every item in each topic fit nicely with these topics. In particular, the rubric items that applied to many students (>90%) would often not follow the general topic.

Ultimately, I was able to group the topics into four major concepts: thermodynamic processes, particle kinetics and equations of state, entropy, and conservation laws. The following spider charts show various student's abilities in each of the topics. I assumed each topic in a concept contributed equally to the concept.

Aptitude in the four main concepts for an excellent student (left) an average student (middle) and a below average student (right).

Conclusions

Since the data is structured to be positive and negative (points can be given or taken off), there may be other matrix decompositions that deal with the data better. In principle, this same analysis could be done using not the matrix of points, but the matrix of boolean (1/0) indicators of rubric items. This would also allow us to take into account the zero point rubric items that were ignored in the analysis. I do not know how this would change the observed results.

I had to manually look through the descriptions of rubric items that applied to each topic and determine what the topic being represented was. An exciting (though challenging) prospect would be to be able to automate this process. This is tricky, though, as associations that $S$ and entropy are the same could be tricky. There may also be insights from having "global" topics across different semesters of the same course in Gradescope.

The code I used for this post is available here.

# Amtrak and Survival Analysis

I got the idea for this blog post while waiting ~40 minutes for my Amtrak train the other week. While I use Amtrak a handful of times a year, and generally appreciate it I do find it ridiculous how unreliable its timing can be. (This is mostly due to Amtrak not owning the tracks they use, but I will not get into that here). Most of the delays I've experienced lately haven't been too bad (40 minutes is on the high end), but when I lived in North Carolina and was often taking the Carolinian train from Durham to Charlotte, the story was quite different. I can't remember a single time when the train arrived on time to Durham, and often the delays were over an hour.

This brings me to the point of this post, which is to answer the question, when should I get to the Durham Amtrak station if I want to catch the Carolinian to Charlotte? I'll assume that real-time train delay data isn't available so that past information is all I have to go off of. Certainly if all the trains are actually an hour late, I might as well show up an hour past the scheduled time and I would still always catch the train. Amtrak claims the Carolinian is on time 20-30% of the time, so presumably showing up late would make you miss about this many trains.

Fig. 1: Delay of arrival of the 79 train to the Durham station for each day since 7/4/2006 (with cancelled trains omitted). Note that in the first year and a half of this data, there are almost no trains that arrive on time, but the situation has improved over the years.

All Amtrak arrival data since 7/4/2006 is available on this amazing website. I got all the data available for the 79 train arriving to the Durham station. I've plotted the arrival times during this time in Fig. 1.

A simple frequentist approach

I can consider each train trip as an "experiment" where I sample the distribution of arrival times to the Durham station. The particular train I take is just another experiment, and I would expect it to follow the available distribution of arrivals. Thus, the probability of me missing the train if I arrive $\tau$ minutes after the scheduled time is

Where $N(t>t')$ counts the number of arrivals in the available data where the arrival $t'$ is greater than the specified $\tau.$ The question, then, is how much of the data to include in $N(t>t').$ To test this, I considered a half year's worth of data as a test set. Then, I figured out how much of the previous data I should use as my training set to most accurately capture the delays in the test set. I found that using a year of data prior to the prediction time worked the best. The method is not perfect; the percentage of missed trains predicted using the training set is about 10% off from the number in the test set, as there are year-to-year variations in the data.

A plot of $p(\tau)$ using all the available data and only using the last year of data is shown in Fig. 2. Using only the last year to build the model, to have an 80% chance of making the train, one can show up about 20 minutes after the scheduled time. This also confirms Amtrak's estimate that their trains are on time 20-30% of the time. Even if one shows up an hour after the scheduled time, though, he or she still has a 36% chance of making the train!

Fig. 2: $p(\tau)$ determined using all of the available data (blue) and only the last year of data (gold). I see that for delays longer than 60 minutes, the two curves are similar, indicating that for long waits either prediction method would give similar results. It appears that in the last year the shorter delays have been worse than the long-term average, as there are significant discrepancies in the curves for shorter delays.

A Bayesian approach

With a Bayesian approach, I would like to write down the probability of a delay, $\delta,$ given the data of past arrivals, $\mathcal{D}.$ I will call this $p(\delta|\mathcal{D}).$ Suppose I have a model, characterized by a set of $n$ unknown parameters $\vec{a}$ that describes the probability of delay. I will assume all the important information that can be extracted from $\mathcal{D}$ is contained in $\vec{a}.$ Then, I can decompose the delay distribution as

Using Bayes theorem, $p(\vec{a}|\mathcal{D})$ can then be expressed as

Here, $p(\mathcal{D}|\vec{a})$ is the likelihood function (the model evaluated at all of the data points), $\pi(\vec{a})$ is the prior on the model parameters, and $p(\mathcal{D})$ is the evidence that serves as a normalization factor.I use non-informative priors for $\pi(\vec{a}).$

The question is, then, what the model should be. A priori, I have no reason to suspect any model over another, so I decided to try many and see which one described the data best. To do this, I used the Bayes factor, much like I used previously, with the different models representing different hypotheses. The evidence for a model $\mathcal{M}_1$ is $f$ times greater than the evidence for a model $\mathcal{M}_2$ where

As the models are assumed to depend on parameters $\vec{a}$ (note that a method that does not explicitly have a functional form, such as a machine learning method, could still be used if $p(\mathcal{D}|\mathcal{M})$ could be estimated another way)

Here, $\delta_i$ are all of the delays contained in $\mathcal{D}.$ This integral becomes difficult for large $n$ (even $n=3$ is getting annoying). To make it more tractable, let $l(\vec{a}) = \ln(p(\mathcal{D}|\vec{a})),$ and let $\vec{a}^*$ be the value of the fit parameters that maximize $l(\vec{a}).$ Expanding as a Taylor series gives

where $H$ is the matrix of second derivatives of $l(\vec{a})$ evaluated at $\vec{a}^*.$ The integral can be evaluated using the Laplace approximation, giving

which can now be evaluated by finding a $\vec{a}^*$ that maximizes $p(\mathcal{D}|\vec{a}).$ (Regular priors much be chosen for $\pi(\vec{a}^*|\mathcal{M})$ since I have to evaluate the prior. I will ignore this point here). I tested the exponential, Gompertz, and Gamma/Gompertz distributions, and found under this procedure that the Gamma/Gompertz function described the data the best under this metric. Using this, I explicitly calculate $p(\delta|\mathcal{D}),$ again under the Laplace approximation. This gives the curve shown in Fig. 3, which, as expected, looks quite similar to Fig. 2.

While this section got a bit technical, it confirms the results of the earlier simple analysis. In particular, this predicts that one should show up about 17 minutes after the scheduled arrival time to ensure that he or she will catch 80% of trains, and that one still has 30% chance of catching the train if one shows up an hour late to the Durham station.

Fig. 3: $p(\delta|\mathcal{D})$ calculated using only the last year of data. Note that this curve is quite similar to Fig. 2.

Conclusions

Since I downloaded all the data, 5 days have passed and in that time, the delay of the 79 train has been 22, 40, 51, 45, and 97 minutes. It's a small sample size, but it seems like the prediction that 80% of the time one would be fine showing up 20 minutes late to the Durham station isn't such a bad guideline.

Of course, both of these models described are still incomplete. Especially with the frequentist approach, I have not been careful about statistical uncertainties, and both methods are plagued by systematic uncertainties. One such systematic uncertainties is that all days are not equivalent. Certain days will be more likely to have a delay than other. For example, I am sure the Sunday after Thanksgiving almost always has delays. No such patterns are taken into account in the model, and for a true model of delays these should either be included or the effect of such systematic fluctuations could be characterized.

Scripts for this post are here.

# Hypothesis Testing in a Jury

I recently served on a jury and was quite surprised as how unobjective some of the other jurors were being when thinking about the case. For our case, it turned out not to matter because the decision was obvious, but it got me thinking about a formal reasoning behind "beyond a reasonable doubt." This reasoning will involve more statistics than physics, but considering I've been thinking about Bayesian analyses recently in my research, it's quite appropriate.

At the most basic level, a jury decision is a hypothesis test. I wish to distinguish between the hypotheses of not guilty (call it $\mathcal{H}_0,$ since the defendant is innocent until proven guilty) and guilty (call it $\mathcal{H}_1$). In Bayesian statistics, the way to compare two hypotheses is by computing the ratio of posterior probabilities.

Where $p(\mathcal{H}|D)$ is the probability of the assumption $\mathcal{H}$ given the available data (the evidence). This probably does not seem obvious to compute, and I'll discuss later how one might determine values for these. If $F=2,$ then the hypothesis $\mathcal{H}_1$ is twice as likely as the hypothesis $\mathcal{H}_0$. Thus, if $F \gg 1,$ then the evidence for $\mathcal{H}_1$ is overwhelming. What this means in terms of "beyond a reasonable doubt" is debatable, but it is generally accepted that if $F \gtrsim 100,$ there is strong evidence for $\mathcal{H}_1$ over $\mathcal{H}_0$ [1]. Similarly, if $F\ll 1,$ then the evidence for $\mathcal{H}_0$ is overwhelming. Thus, the question or determining guilt or not guilt is equivalent to calculating $F.$

$p(\mathcal{H}|D)$ can be rewritten using Bayes theorem as

Here, $p(D|\mathcal{H})$ is the probability of the evidence given the assumption of guiltiness (or not-guiltiness), which is more tractable than $p(\mathcal{H}|D)$ itself. Note the prosecutor's fallacy can be thought of as confusing $p(D|\mathcal{H})$ with $p(\mathcal{H}|D).$ $p(\mathcal{H})$ is the prior, which takes into account how much one believes $\mathcal{H}$ with regard to other hypotheses. $p(D)$ is a normalization factor to ensure probabilities are always less than 1. In the relation for $F,$ this cancels out, so there is no need to worry about this term. With this replacement, the ratio of posterior probabilities becomes

The first ratio is called the Bayes factor. The second ratio quantifies the ratio of prior beliefs of the hypotheses. The ratio is, given no other information, the odds that the defendant is guilty. Suppose I accept that there was a crime committed, but that the identity of the criminal is in question. If there is only one person that committed the crime, this would then be the inverse of the number of people who could have committed the crime.

Now, I will consider the calculation of the Bayes factor for a real trial. Consider R v. Adams, which set a precedent for banning explicit Bayesian reasoning in British Courts in the context of DNA evidence. It was estimated during the trial that there were roughly 200,000 men in the age range 20-60 who could have committed a crime. Note that some extra assumptions on age and gender seem to be made here, so this does not seem applicable to the ratio of prior beliefs. However, if I lifted these restrictions, the Bayes factor for the victim's misidentification of the defendant would change accordingly, so this is not a concern.

First, consider the DNA evidence that was the only piece of evidence incriminating the defendant. Call this evidence $D_0.$ $p(D_0|\mathcal{H}_1)$ is the probability of a positive DNA match under the assumption that the defendant is guilty. This is presumably extremely close to 1, or DNA evidence would not be considered good evidence in trials. $p(D_0|\mathcal{H}_0)$ is the probability of a positive match if the defendant is not guilty. Taking into account the population of the U.K., this was estimated in the trial to be between 1 in 2 million and 1 in 200 million (though possibly as low as 1 in 200 since the defendant had a half-brother) [2]. Thus, the Bayes factor considering only the DNA evidence, is between 2 million and 200 million. With the 1 in 200,000 prior probability, the posterior probability ratio $F$ is between 10 and 1000. Only the higher end of this range is overwhelming evidence, and in the case of conflicting evidence, the jury is supposed to give the benefit of the doubt to the defendant, so it seems a "not guilty" verdict would have been appropriate.

Further, this ignores all of the other evidence that helped to prove the defendant's innocence. This included the victim failing to identify the defendant as the attacker and the defendant having an alibi for the night in question. Let us call these two pieces of evidence $D_1$ and $D_2.$ Unlike the DNA evidence, the witnesses do not explicitly mention what the relevant probabilities are for this evidence, so it is up to the jurors to make reasonable estimates for these quantities. $p(D_1|\mathcal{H}_1)$ is the probability the victim fails to identify the defendant as the attacker given the defendant's guilt. Set this to be around 10%, though police departments may actually have statistics on this rate. On the other hand, $p(D_1|\mathcal{H}_0),$ the probability the victim fails to identify the defendant as the attacker given the defendant is not guilty is high, say around 90%. Thus, the Bayes factor considering the victim's failed identification of the defendant is about 1 in 10. Note that even if these numbers change by 10% this factor doesn't change in order of magnitude, so as long as a reasonable estimate is made for this factor, it doesn't really matter what the actual value is. The alibi is less convincing. Though the defendant's girlfriend testified, the defendant and the girlfriend could have confirmed their story with each other. Thus, I estimate the Bayes factor for the alibi $p(D_2|\mathcal{H}_1)/p(D_2|\mathcal{H}_0)$ to be about 1 in 2.

Since all these pieces of evidence are independent, $p(D_0,D_1,D_2|\mathcal{H})=p(D_0|\mathcal{H})p(D_1|\mathcal{H})p(D_2|\mathcal{H}).$ Thus, the Bayes factor for all evidence is between 100,000 and 10,000,000. Now, multiplying by the prior probability, this gives a posterior probability ratio, $F,$ between 0.5 and 50. With the new evidence taken into account, there is no longer strong evidence that the defendant is guilty even in the best case scenario for the prosecution. Convicting someone with a posterior probability ratio of 50 would falsely convict people 2% of the time, which seems like an unacceptable rate if one is taking the notion of innocent until proven guilty seriously. Note that as long as the order of magnitude of each of the Bayes factor estimates doesn't change, the final result will also not change by more than 1 order of magnitude, so the outcome is fairly robust.

While this line of logic was presented to the jurors during the trial, the jurors still found the defendant guilty. The judge took objection to coming out with a definite number for the odds of guilt when the assumptions going into it are uncertain, though as argued before, for any reasonable choice, these numbers cannot change too much [2]. It seems that without formal training in statistics, it was difficult to accept these rules as "objective," even though this is a provably, well-defined, mathematical way to arrive at a decision. If common sense and the rules of logic and probability are really what jurors are considering to reach their decision, this has to be the outcome that they reach. [3] argues that to not believe the outcome of a Bayesian argument like this would be akin to not believing the result of a long division calculation done on a calculator.

The most common objection to Bayesian reasoning (though apparently not the one the judges in R v. Adams had) is that the choice of prior can be somewhat arbitrary. In the example above, in the estimation of people that could have committed the crime, I could take people who were on the same block, the same neighborhood, the same city, or maybe even the same state. Each of these would certainly give different answers, so care must be taken to choose appropriate values for the prior. This doesn't make the method wrong or unobjective. It is just that the method cannot start with no absolutely no assumptions. Given a basic assumption, though, it provides a systematic way to see how the basic assumption changes when all the evidence is considered.

This problem seems to stem from a misunderstanding of basic statistics by the jury, attorneys, and judges. Another example of this is the U.K. Judge Edwards-Stuart who claimed that putting a probability on an event that has already happened is "pseudo-mathematics" [4]. This just shows the judge's ignorance, as this is precisely the type of problem Bayesian inference can explain. It is a shame that in the U.K. Bayesian reasoning is actively discouraged due to R v. Adams, as this is the only rigorous way to deal with these types of problems. I wasn't able to find any specific examples in the U.S., but I assume the "fear" of Bayesian statistics in courts here is similar to the case in the U.K.

References
1. Jeffreys, H. 1998. The Theory of Probability.
2. Lynch, M. and McNally, M., 2003. “Science,” “common sense,” and DNA evidence: a legal controversy about the public understanding of science. Public Understanding of Science, 12-1, 83-103.
3. Fenton, N. and Neil, M., 2011. Avoiding Probabilistic Reasoning Fallacies in Legal Practice using Bayesian Networks. Austl. J. Leg. Phil., 36, 114-151.
4. Nulty and Others v. Milton Keynes Borough Council, 2013. EWCA Civ 15.

# Resistors and Distance on Graphs

I feel bad for not having written a post in so long! I have been busy with teaching, research, and various other projects. Now that my teaching duties are done, I will try to post more regularly!

A graph is a collection of nodes that are connected by edges that are drawn under a defined condition. These edges may or may not have weights. Graphs are useful representations of data in many scenarios. For example, the Internet is an example of a graph. Each web page would be a node and an edge would be drawn between two nodes if one of the pages links to the other. These edges could be unweighted or could represent the number of links between the two pages. Another example is social media. For example, the nodes could be all users of the service and an edge would be drawn if two nodes are "friends" with each other.

One way to define a distance between two nodes on the graph may be to find the shortest path between the nodes. This would be defined as the collection of edges from one node to the other such that the sum of the weights of the edges (or simply the number of edges in the case of an unweighted graph) is minimized. This is what LinkedIn seems to do when they compute the "degree" of connection of a stranger to you. This is also how Bacon and Erdös numbers are defined.

Fig. 1: Two graphs with the same shortest path distance between A(lice) and B(ob). However, it is clear that in the right graph A(lice) seems better connected to B(ob) than in the left graph.

A shortcoming of this measure, though, is that shortest path ignores how many paths there are from one node to the other. This scenario is depicted in Fig. 1. Suppose in a social network, we would like to determine something about how likely it is that one person (say Alice) will meet another (say Bob) in the future. Suppose Bob is a friend of only one of Alice's friends. Then, given no other information, it seems unlikely that Alice would ever meet Bob, since there is only one avenue for Alice to meet Bob. Of course, this could be different if Bob was good friends with Alice's significant other, so we might want to weight edges if we have information about how close Alice is to her friends in the social network. Now, if Bob is friends with half of Alice's friends, it seems quite likely that when Alice goes to a party with her friends or is hanging out with one of those friends, then Alice will run into Bob. In both of these cases, the shortest path distance between Alice and Bob is the same, but the outcome of how likely they are to meet (which is what we want to analyze) is different.

It turns out that a useful analogy can be made by considering each edge as a resistor in a resistor network. In the unweighted case, each edge is a resistor with resistance 1 (or $1~\Omega$ if having a resistance without units bothers you, though units will be dropped for the rest of the post) and in the weighted case, each edge is a resistor with resistance equal to the inverse weight of the edge. We will see that the effective resistance between two nodes is a good measure of distance that more accurately captures the scenario described above. Groups of nodes with small effective resistance between them will correspond to clusters of people (such as people who work in one workplace) in the social network.

The effective resistance satisfies the triangle inequality, which is a defining property of distances [1]. We can see this as follows. The effective resistance between nodes $a$ and $b$ is the voltage between the two nodes when one unit of current is sent into $a$ and extracted from $b.$ Let the voltage at $b$ be zero (we are always free to set zero potential wherever we like). Then $R_{ab} = v_a = (v_a-v_c)+v_c.$ Now, we know that $v_a-v_c \leq R_{ac},$ since the potential should be maximal at the source (current doesn't climb uphill). Similarly, $v_c\leq R_{cb}$ (remember node $b$ is grounded). Thus, we have that $R_{ab} \leq R_{ac}+R_{cb},$ and the triangle inequality holds. The other defining characteristics hold quite trivially, so the effective resistance is a reasonable way to calculate distance.

Now consider the example presented before and depicted in Fig. 1. If Bob is only friends with one of Alice's friends, and there are no other links between Alice and Bob, then the effective resistance between Alice and Bob is 2. In this case, the effective resistance is the same as the shortest path distance. If Bob is friends with 7 of Alice's friends, and there are no other links between Alice and Bob, the effective resistance between Alice and Bob is 0.29. So, we see that this satisfies the property that having one connection is not as important as having many connections to Bob.

Another interesting consequence of this analogy is related to random walks. Suppose a walker starts at a node $v$, and chooses a random edge connected to that node to walk along (the random choice is weighted by edge weights if the graph is weighted). Then, the expected number of steps to start at $v,$ go to another node $w,$ and return back to $v$ (which is called the commute time) is proportional to the effective resistance between $v$ and $w.$ One way to think about this is that a way to estimate effective resistance, then, on a large resistor network would be through the use of random walks [2]. This is interesting for tricky resistor problems such as the infinite grid of resistors. This also reinforces the idea that effective resistance is a good way of quantifying communities. When one node is well connected with another, it should be relatively easy to commute between the nodes, and thus they should be part of the same community. Further, this notion can be used to place bounds on the maximum possible commute time.

Effective resistance has many other uses as well. Quickly computing effective resistance turns out to be quite useful in speeding up an algorithm that solves the maximum flow problem [3]. In addition, sampling edges by weights proportional to the effective resistance across the edge yields an algorithm that sparsifies graphs. This means that the new graph has fewer nodes and edges than the original graph, but looks "close" to the original graph, and is a more efficient representation of that information [4].

References
1. Lau, L.C., 2015. Lecture notes on electrical networks.
2. Ellens, W., 2011. Effective Resistance.
3. Christiano, P., et al., 2010. Electrical Flows, Laplacian Systems, and Faster Approximation of Maximum Flow in Undirected Graphs.
4. Spielman, D. and Srivastava, N., 2011. Graph Sparsification by Effective Resistances. SIAM J. Comput., 40(6), 1913–1926.

# Ferguson and Game Theory

Note: In this post I use words like "player," "game," and "payout" to refer to the interactions between criminals and police officers. I do not mean to trivialize deadly encounters between criminals and police officers and I am not implying that police officers are playing a game at their jobs. These are the terms that are used in discussions of game theory and to stay consistent with this, I use these terms.

In this blog, I usually talk about how ideas of physics apply to studying problems "outside physics". As far as I know, game theory isn't too useful in physics, but physicists have contributed to the theory of it as in [1]. Game theory is an interesting way to mathematically model interactions of people.

In a game theory game such as the prisoner's dilemma, each player has a choice of cooperating or defecting with the opponent. If the player cooperates while the other cooperates, the two players both get a mid sized reward. If one player cooperates while the other defects, that player gets nothing while the opponent wins a large reward. In the player defecting, opponent defecting case, the converse occurs. If both players defect, they both get a small reward. While seemingly simple, there is a rich theory to games like these. In particular, the iterated prisoner's dilemma (where this game is repeated many times) is used to study population dynamics.

In a game, the rewards (or payouts as they will be referred to later) are characterized by a payout matrix. The payout matrix shows, for each player, what his or her payout will be given each of the possible outcomes.

Fig. 1: Examples of payout matrices for each player in the prisoner's dilemma. If both players cooperate, they both get a payout of 3. If both players defect, they both get a payout of 1. If one player defects while the other cooperates, the defector gets 5 while the cooperator gets 0. Note that for the prisoner's dilemma, the game looks the same for each player. In the game I consider below, this is not the case.

We will apply this type of game theory logic to whether a police officer should shoot a criminal or not. This game is more complicated than the classical prisoner's dilemma as the payoff is not the same to either player (police officer or criminal), even if we assume each player values their life the same. Further, the criminal's payoff matrix is certain, while the police officer's payoff matrix has some uncertainty.

The scenario I am imagining is the following, inspired by the Ferguson incident. A criminal has committed a crime and is being chased down by a (or a pair of) police officer(s). I will take cooperating to mean that one player does not attack the other player, and defecting to mean that one player does attack the other player. I assume that if a player has a gun and defects, the other player dies, though most of the following argument would also follow if instead of dying the player is severely injured. For payouts, I assume a hierarchy where the payout for nothing happening >> the payout for arrest, suspension >> the payout for death (severe injury). I will take specific values of 0,-100,-10000 for these payouts mostly to make the arguments easier to understand, but the general results should hold regardless of the specific values chosen, as long as this hierarchy is in place. I assume that the game is played only once between the criminal and police officer(s), but it would be possible to think of this as an iterated game, where both players have to assess at each stage how much danger they are in and determine to defect or cooperate.

The payout matrix for the law enforcement officer is uncertain. The payout looks like the following, where the top consequences are the payouts when the criminal has a gun (or a lethal weapon) and the bottom are consequences when the criminal does not.

Fig 2: Payout matrix for the law enforcement officer. The top consequences are the consequences when the criminal has a lethal weapon, whereas the bottom consequences are when the criminal does not.

I would like to assign numerical values to these scenarios. In order to average the two scenarios (criminal having a lethal weapon and not) I will assume a prior distribution. This would be the fraction of criminals that are arrested by police officers that are carrying lethal weapons. This prior distribution presumably depends on the type of the crime the criminal committed, but I will not take this factor into account in the simple model I am considering. I was not able to find information about this distribution easily (though I hope this information is available to law enforcement officers) so I assume that 50% of criminals are carrying lethal weapons. However, I expect that this is an overestimate.

Fig 3: Numerical values for the payout matrix for the law enforcement officer. The values are obtained by averaging the payouts of the each of the consequences weighted by the probability that the criminal has a lethal weapon.

Thus, if the law enforcement officer knows that the criminal is cooperating, the optimal thing for him or her to do is to cooperate as well, as this is what maximizes the payout. Unfortunately, there is a chance the criminal will defect. In this case, the police officer should still cooperate, as there is a chance the criminal does not have a lethal weapon. Then, the officer has no consequences, which is better than facing suspension. Note, however, that if the criminal is defecting, the difference in payout for the two strategies of the police officer is quite small (0.1% with the numbers I have chosen). The two strategies are almost equivalent to the police officer.

Now consider the criminal. The payout matrix for the criminal is certain, as it is common knowledge that law enforcement officers carry weapons. It would look like the matrices below.

Fig 4: Descriptive and numerical payout matrices for the criminal. We take the criminal defecting, police officer cooperating case to be -1000 as it is part way between -100 and -100000. This may be an underestimate as if a criminal manages to shoot a police officer, there is a chance that another police officer will shoot the criminal.

The interesting thing here is that if the police officer is defecting (which, recall is not such a bad strategy for the police officer if the police officer assumes the criminal will defect), there is no optimal strategy for the criminal. The criminal dies in both cases. Clearly then, as there is no strategy if the police officer defects, the criminal can only optimize the case when the police officer cooperates, meaning the criminal should cooperate. Thus, the Nash equilibrium is for the criminal and police officer to cooperate. This is presumably what happens most of the time. Criminals are arrested every day but we (comparatively) rarely hear about incidents where the criminal or police officer is killed. So does this mean game theory cannot explain what happened at Ferguson? Not quite.

Let's say the strategy of the police officer is to defect a fraction $x$ of the time and to cooperate $1-x$ of the time. Then the expected payout (probability multiplied by the payout of that case) for the criminal is $-100(1-x)-100000x$ if the criminal cooperates and $-1000(1-x)-100000x$ if the criminal defects. As an example, take $x$ to be $0.01$. At this value, the expected payout for the criminal is $-1099$ if he or she cooperates and $-1990$ if he or she defects. Clearly, the defecting strategy is still worse, but the difference between cooperating and defecting is only about 50%. The difference would be even less if the fraction $x$ were higher or if the payout of death were even more negative. Thus, if there's any reason for the criminal to think that the police officer may defect (and given news reports of incidents like Ferguson, there is probably reason for the criminal to expect this), the difference between cooperating and defecting is not so different in terms of payout. This means the actions of the criminal become unpredictable, which means the police officer may need to defect to preserve his or her life.

The main cause of this issue is that dying is really bad. Thus, if there's even a small chance that the police officer will defect, the criminal's choices become mostly equivalent. Now suppose police officers are only able to use non-lethal means against the criminal. Then the payout matrix for the criminal looks quite different.

Fig 5: Descriptive and numerical payout matrices for the criminal when the police officer cannot use lethal means against him or her. I choose a value of -200 for getting hurt, though it may not necessarily be this bad. If the criminal can prove that the police officer attacked him or her without just cause, the criminal may even receive some sympathy.

Now, if the criminal defects, the payout is -1000 no matter what, while if the criminal cooperates the expected payout can be anywhere between -100 and -200, depending on the police officer's strategy. The expected payout is no longer (approximately) degenerate as above. The clear choice here is to cooperate, as defecting is many times worse. If the criminals and police officers are acting rationally, then an assurance by law enforcement to not kill criminals seems to improve the rate that the criminals should cooperate. Of course, the problem with this is, that while the criminal's payoff matrix has changed by ensuring the police officer will not kill the criminal, the police officer's payoff matrix remains the same.

As mentioned earlier, unless the probability the criminal has a lethal weapon is extremely small, if the criminal has a chance to defect, the expected payout for the police officer is also nearly degenerate no matter the police officer's strategy. The police officer's situation could be improved though, if the chance of death of the police officer were smaller. This could include protective equipment and training to not get fatally wounded in these encounters. In addition, there is a crucial difference between the criminal and police officer in that the police officer has more experience with these encounters than the criminal does (though the police officer is disadvantaged in that he or she has less information about the opponent). Thus, it is imperative that the police officer is trained to make the right decisions in these tough situations. As stated before, this is presumably what happens most of the time when criminals get arrested, and thus the police officers are usually doing a good job. If the payoff to the police officer were truly degenerate, we would expect the police officers to defect about half of the time. However, there are incidents where the police officer unnecessarily perceives danger. The police officers should be aware aware of their risks so that tragedies do not occur.

The payoff matrix for the police officer also looks quite different based on the prior distribution (or expectation) of the criminal's likelihood to be carrying a lethal weapon. I assume I have overestimated the likelihood a criminal carries a lethal weapon, but with this information the police officer's "strategy" could be improved, and as I stated above, this could prevent people from unnecessarily dying.

Most of the analysis above has depended on the players of the game acting at least approximately rationally. It is an interesting question, though, how rationally we can expect the players in this game are playing. The criminal presumably is "on edge" from just having committed a crime and being chased down by police officers would probably not help the situation. In a study done with the game show Golden Balls, contestants were found to be more altruistic, in general, than a simple game theory analysis of rational players would predict [2]. This shows that even in scenarios where life or death is not on the line, people may not be expected to act rationally. With information about how criminals are likely to act, police officers can develop better strategies on how to deal with situations such as this one.

References
1. Press, W.H., 2012. Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent. Proc. Natl. Acad. Sci. 109-26, 10409–10413.
2. Van den Assem, M.J., 2012. Split or Steal? Cooperative Behavior When the Stakes Are Large. Management Science. 58-1, 2-20.

# Pedestrian dynamics

I've written about modeling the movement of cars as a fluid in the past. We could think about pedestrians like this, but usually pedestrians aren't described well by a fluid model. This is because while cars are mostly constrained to move in one direction (in lanes), this is not true of pedestrians. On a sidewalk, people can be walking in opposite directions, and often someone walking in one direction will directly interact (by getting close to) someone walking in the opposite direction. There are some specific scenarios where a fluid model could work, such as a crowd leaving a stadium after the conclusion of a basketball game. In this case, everyone is trying to get away from the stadium so there is some kind of flow. However, this doesn't work generally, so I will consider a different type of model, similar to the one described in [1].

If there are only two directions that people want to travel and they happen to be opposite, then we could model the pedestrians as charged particles. The pedestrians that want to go in opposite directions would be oppositely charged, and the force that keeps the pedestrians on a trajectory could look like an electric field. However, this would mean that people moving in opposite directions would attract each other, which really does not match expectations. This model also fails if there are multiple directions where pedestrians want to go (such as at an intersection), or if the desired directions of the pedestrians are not opposite. While a plasma (which is a collection of charged particles) model may not be the best to describe the scenario, I will borrow some ideas from plasma dynamics in building my model and I will use techniques used to simulate plasmas to simulate pedestrian movement.

There will be a few effects governing the movement of pedestrians. One effect is for the pedestrians to want to go a desired direction at a desired speed. It turns out that most humans walk at a speed of around $v_d=$1.4 m/s (3.1 mph) and if someone is going slower or faster than this, they will tend to go toward this speed. Let me call the desired speed along the desired direction of the pedestrian $\vec{v}_d,$ and the current walking speed of the pedestrian $\vec{v}.$ I will model the approach to the desired direction as a restorative force that looks like

Here, $\tau$ represents how long it takes the pedestrian to get back to their desired position and direction once they are off track. In general, $\tau$ could be different for every pedestrian, but for simplicity I set it as a constant for all pedestrians here, and I will take it to 0.3 s, which is close to the human reaction time. Note that the restorative force is zero when $\vec{v_d}=\vec{v},$ so if a pedestrian is already going in their desired direction at their desired speed, there will be no restorative force and the pedestrian will continue to go at this direction and speed. You may find it odd that my force has units of acceleration. I am thinking about this more as a generalized sense of the term force as in something that causes velocity changes, but it would also be reasonable to assume that I have set the mass of the pedestrians to 1.

Pedestrians will also avoid colliding with each other, which is the other force I include in the model. While [1] assumes an exponential force for the interaction force, I will assume that pedestrians interact via a generalized Coulomb potential. The general results seem to match without too much regard for the exact shape of the force. I define the force between pedestrian $i$ and pedestrian $j$ is

Where $\vec{r}_{ij} = \vec{r}_i - \vec{r}_j$. $\gamma,$ $\epsilon,$ $\alpha,$ and $r_0$ are constants that I will describe below. $r_0$ is an interaction radius that sets the scale for this interaction. This would not necessarily be the same for everyone. For example if someone is texting, their interaction radius $r_0$ is probably much smaller than someone who is paying attention to where they are going. However, for simplicity I take it to be the same for everyone, and I take it to have a value of 1.2 m.

Since pedestrians travel in 2 dimensions, if $\alpha = 1$ and $\epsilon = 0,$ this would be the Coulomb potential, if $\gamma$ were aptly chosen. In this scenario, however, I do not really want the Coulomb potential. The Coulomb potential is quite long range, meaning that particles in a Coulomb potential can influence particles that are quite far away. As the power $\alpha$ in the equation above gets larger, the force becomes more short-range, which seems to better model the interactions of pedestrians. However, this presents another problem in that the force gets extremely large if two pedestrians happen to get really close to one another. To combat this, $\epsilon$ is a small number that "softens" the force such that the force never gets extremely large (which I took to mean $|\vec{F}_{ij}|$ should never be too much bigger than the maximum possible value of $|\vec{F}_{restore}|$). $\gamma$ then decides the relative importance of this interaction force to the restorative force.

I will simulate this model by considering $N$ people are in a long hallway with aspect ratio 1:10, for example at an airport or a train station. This can also be a model for a long, wide sidewalk as even though there are no walls, people are relatively constrained to stay on the sidewalk. I have some people trying to get to one end of the hallway (in the $+\hat{x}$ direction) and some trying to get to the other end (in the $-\hat{x}$ direction). This is an example of an N-body simulation, which is widely used in studying gravitational systems and plasma systems.

In [1], the walls exerted an exponential force on the pedestrians. I choose a similar model. I set the parameters of the exponential empirically such that the pedestrians keep a reasonable distance from the walls. I set the range of the exponential force to be a tenth of the total width of the corridor. I set the magnitude such that at the maximum, the force due to the wall is the same as the maximum value of $|\vec{F}_{restore}|.$

When a pedestrian reaches the end of the corridor, I record how much time it took for that pedestrian to traverse the corridor. I then regenerate the pedestrian at the other end of the corridor as a new pedestrian. I generate the new pedestrian with a random $y$ coordinate and a random velocity direction, but pointing at least a small bit in the desired direction. The magnitude of the velocity is taken to be $v_d.$ Thus, the simulation is set up such that there will always be $N$ people in the hallway.

A simulation of $N$=100 people in a hallway of dimensions 100 m x 10 m. All pedestrians desire to go to the left of the hallway. The pedestrians relax to a state where they are each about the same distance from each other. It seems that people usually stand closer together on average, so our value of $r_0$ should probably be smaller to match observations.

The first thing I tried was to simply put a few people in the hallway all wanting to go in the same direction, and see what they do. I set the length of the hallway to be 100 m, which made the width of the hallway 10 m. As can be seen above, this isn't too exciting. The pedestrians' paths are mostly unobstructed and they get across the hallway in about 71 s, which is the length of the hallway divided by 1.4 m/s, the comfortable walking speed of the pedestrians. Even in this simple case, though, it is apparent that the pedestrians "relax" into a scenario where the average distance between the pedestrians is roughly the same.

A simulation of $N$=100 people in a hallway of dimensions 50 m x 5 m. Pedestrians are equally likely to want to go left or right. We can see that lanes of people that would like to go in the same direction can form, as was observed in [2]. This effect could be even stronger with an extra "incentive force" for people to be on the right side of the road if they are not already on that side.

The time required to cross the room as a function of density of people in a simulation of $N$=100 people. The y-axis is normalized by the length of the room divided by the desired velocity (1.4 m/s). $p=0.5,$ which means half of the pedestrians desire to go to the left and the other half desire to go to right. I change the density of people by changing the size of the room from 2.5 m x 25 m to 10m x 100 m. As the density is higher, the pedestrians interact more with each other and thus are less likely to be on their desired trajectory.

Next, I looked at the more interesting cases of what happens when there are pedestrians that want to go in different directions. First, I assume that exactly half of the pedestrians would like to go in one direction and half would like to go the other direction. I then varied the length and width of the hallway, keeping the aspect ratio constant, while keeping the number of people in the hallway constant at 100. This has the effect of changing the density of people in the hallway. The y-axis on the graph above is normalized by $L/v_d,$ which is the time a pedestrian with all of his or her velocity in the desired direction would take. This shows that as the density increases, it takes longer (proportionally) for the pedestrians to get across the room. This makes sense as the pedestrians are interacting more often and thus cannot keep going in the desired direction.

A simulation of $N$=100 people in a hallway of dimensions 50 m x 5 m. 90% of pedestrians want to go left, while the other 10% want to go right. The right-going pedestrians undergo many interactions with the left-going pedestrians. In fact, if the left-going pedestrians were denser, this could look like Brownian motion.

The time required to cross the room as a function of $p$, the fraction of $N$=100 people that would like to go left or right. The y-axis is normalized by the length of the room divided by the desired velocity (1.4 m/s). The size of the room is 5 m x 50 m. The blue line is the time to get across for the pedestrians going leftward, and the red line in the time to get across for the pedestrians going rightward. As the fraction of pedestrians going leftward increases, it becomes easier for those pedestrians to get across, but it makes it harder for the pedestrians that would like to go in the opposite direction to get across more slowly.

I then took the number of people in the hallway to be 100 with the length of the hallway being 50 m and the width being 5 m. I observed what happened as I varied the fraction of pedestrians, $p$ that wanted to go in either direction. This effect is shown above. As $p$ is increased, the more dominant pedestrians can get through the corridor more quickly than the less dominant pedestrians. Again, this makes sense as when people go "against the gradient," they have to weave through people to try to get to the other side.

I will note that I have not done this simulation in the most efficient way. For every pedestrian, I calculate the interaction force with all the other pedestrians and add up all the contributions. It turns out one can average or sometimes even ignore the effect of pedestrians far away, which can make the code run about $1/N$ times faster.

The python and gnuplot scripts I used for the simulation and to create the plots are available here.

References:
1. Kwak, J., 2014. Modeling Pedestrian Switching Behavior for Attractions. Transportation Research Procedia. 2. 612-617.
2. Tao, X., 2011. A Macroscopic Approach to the Lane Formation Phenomenon in Pedestrian Counterflow. Chinese Phys. Lett. 28.