Thoughts on Teaching

While I was getting my Ph.D., I taught almost every semester. I was a head TA for all but one semester which meant that I managed a team of TAs and worked closely with the course instructors, but didn't do any classroom teaching. I taught introductory electricity and magnetism the entire time, and as I saw various professors and various students go through the material, I noticed that our course really wasn't optimized to maximize how well students learn the course material. This was surprising to me as I was at an institution that is consistently rated highly by many university rating agencies.

I'll borrow a concept from Wait But Why of how remarkable advancements are. For example, my parents used computers the size of a room that has the computational capacity of a computer that now can be worn around the wrist. However, the way my parents were taught isn't too different from modern teaching. For the most part, lectures are still a lecturer standing at the front of the room writing on a blackboard. The lecturer may use a powerpoint presentation, and the students may be distracted from their laptops and phones, but the fundamental way of teaching hasn't really changed.

There are, of course, efforts to modernize learning through websites like Coursera and Khan Academy. I think the real benefit of these is real-time feedback to the instructor (in this case, the algorithms behind the course). In the class I taught, the instructor only got individual feedback from the questions students asked (which not everyone does), as they never saw student's homework or test scores except in aggregate. However, taking courses through these online services is not recognized as sufficient mastery when compared to a university course, and I doubt that this will change anytime soon. So, I thought it would be interesting to consider what keeps universities from making efforts to make their teaching more effective.

Universities are not incentivized to teach well
When I started teaching, I thought that the metric I should optimize for was student's understanding of the course material. The problem is that this is not easy to measure. Most grading at the university level seems to be done on a curve (which as I've written before, I have concerns about) to sidestep this issue. Even grading on a curve reasonably is not trivial as writing good exams is an art. I think only one instructor I worked with had truly mastered it. Poorly written, or even poorly administered, exams provide a weak signal of student understanding. As instructors write different exams, it's another challenge to combine signals from separate courses taught by different instructors to get an aggregate view of how students learn.

This is an idea that is talked about in Cathy O'Neil's Weapons of Math Destruction. Because educational quality is something that is ill-defined, rating agencies, like US News, use proxies that seem to correlate with educational quality to rank universities. This sets up a perverse incentive structure, as the university is offering the service of teaching (though some may argue the service they provide is really the degree) but they optimize around the metrics the rating agency chooses instead of actually making the learning experience great. Companies decide to hire from the perceived best schools, which means that if a high school student wants to set up for a great career, it's in their best interest to apply for these top universities even if it means they will not get the best education.

Looking at the US News' metric, it seems that 20-30% of the metric that goes into ranking colleges actually directly comes from how teaching is conducted, considering class size and funds allocated to teaching students. An almost equal proportion (22.5%) of the score is reputation, which would be hard to assess from an outsider who has not taken courses at the university. Part of this reputation could be the university's willingness to try techniques such as more active learning in classes, but this is handled in an ad-hoc way, and I would like to see it be a more direct part of the metric.

Professors are not rewarded for teaching well
I will preface this section by saying that I worked with some amazing instructors while being a head TA. I don't want to belittle the great work that they do. The fact is, though, that there are instructors who are not so good who are still teaching students and that is the topic of the rest of this post.

I found that as I worked to optimize my students' understanding of course material, that not all of the professors shared the same goal. I found that some professors (but definitely not all) were optimizing for student happiness, under the constraints of having to rank students by performance. This makes sense from the instructor's point of view as the metrics the department sees of the instructor is the grade distribution they assign and the feedback they receive from students (and maybe a peer review by other faculty members). I've always found it shocking that the department did not care what the course TAs thought of the professors, though admittedly this could be a noisy signal as some classes only have one TA.

I felt that some of the instructors who I felt were ineffective at getting their students to understand the material got the highest ratings by students. In fact, it's been shown that student surveys are another metric that is a poor proxy for teaching effectiveness.

In fact, tenure makes it hard to get rid of a professor, and I found that often the tenured professors were some of the worst instructors at optimizing student understanding. While being a Nobel prize eligible physicist is an excellent qualification for getting tenure, this really isn't a necessary or even sufficient condition to be a great lecturer¹. On the flip side, some lecturers were some of the best instructors I worked with, but they are hired on limited term contracts and are paid little compared to their tenured colleagues.

Professors are averse to change
As a head TA, I worked with an instructor who was technologically illiterate and could not do much more on his computer than respond to email. There was another professor whose methods I had pedagogical concerns about. When I raised these to him, he retorted with something along the lines of "no, we're not changing it because this is the way I've done it for 20 years and the students like it this way." I worked with others who are oversubscribed with research and family obligations that they can't be bothered to learn a new tool for the course they are teaching. This isn't all the instructors I worked with, but the instructors described here are still teaching and will continue to teach unless broad changes are made.

Where I do see some willingness to adopt new tools is when work is being automated on the professor's side. In our course, we used Mastering Physics to automate the grading of homework assignments and reduce the grading burden on TAs. One concern I have with Mastering Physics is that it locks universities into multi-year contracts, creating friction to move to a new system. Combined with the instructor's desires not to learn a new system, this causes Mastering Physics to stick around, and there is not much pressure for Mastering Physics to improve their services. I've found that this has led to the course continuing to use a suboptimal system without much hope for improvement.

I've noticed some quick wins for Mastering Physics that were not implemented in the time I used the service. For example, I noticed that 75% of the wrong answers my students submitted had incorrect units. It would be relatively easy for Mastering Physics to implement a check for this and give the students more useful feedback so that they learn their mistakes more easily.

Apparently, when Mastering Physics was first introduced into introductory physics courses, the professors found, using some standardized physics understanding tests, that understanding of physics actually increased after using Mastering Physics. The professors attributed this to the fact that the students got instant feedback on whether their answers were correct. This sounds great for Mastering Physics, but the problem is that the solution to the problems are now easily accessible and so it is easy to get credit for the homework problems without actually solving them.

I have no clue what fraction of students actually did the Mastering Physics problems. Sometimes we put problems very similar (or almost identical) to homework problems on the exams, and I'm quite surprised that many students will struggle to do the problems on the exam. Even creating new problems is not a solution, as there are services like Chegg that have others work out problems for the student. In theory, this was possible pre-internet, but the scale of the internet makes it that much easier for students to exploit.

Even beyond that, Mastering Physics and other products like WebAssign are limited in scope. They're rarely used outside of introductory classes because they can really only check the final answer. I do think there is more that can be automated in grading assignments. A lot of hard science and math questions boil down to choosing the correct starting point and showing a logical progression of steps to the final result. This is entirely within the realm of tasks a computer can do. Even for other disciplines where answers may be short answers, advancements in NLP are probably at the point where a solution could be compared with sample response(s) with quite a high accuracy. We would still need a human to check to make sure the model wasn't performing crazily, but this would considerably speed up the grading process.

The Future
Getting rating agencies, universities, and professors to care about teaching effectiveness is not an easy task. To me, the fundamental issue is that teaching effectiveness cannot be measured easily and poor proxies are being used to measure it. If rating agencies could rate universities based on teaching effectiveness, this would put pressure on universities to care more, which would then put pressure on instructors to care more.

I've mentioned before about how analyzing Gradescope data could be useful in helping instructors get a better idea of what each individual student has actually learned. As mentioned earlier, this is quite valuable as the lecturers for the course I taught never graded a homework assignment or a test. I could foresee in the future instructors being given a dashboard where they can look at what individual students understand while also being able to segment by factors such as attendance rate to get near real-time feedback about how they are performing.

But I also think Gradescope may be able to tackle the teaching effectiveness problem. By offering a service that allows for faster, less-biased grading, instructors are likely to adopt the service. I've seen that instructors respond well when tedious parts of their work are automated. As adoption increases, Gradescope gets a unique set of data, with different courses across many different disciplines across many different universities. Making sense of all the data is a challenge, but if I had to guess, I would say that this data has insights about teaching effectiveness that can hopefully drive some of the necessary changes to education.


1. This may not be true in every discipline. I have taken introductory anthropology classes in which the instructor's research was a relevant topic of discussion.

Thoughts on Grad School

For some background, there is no way I would have known in college that I would not have wanted to apply for postdocs after my Ph.D. My career plan then was to go through grad school, be successful, and someday end up a professor. During my Masters' degree, I realized that this may not be the life I wanted. My advisers at the time (who were married to each other) would routinely be at the institute until 8 or 9 in the evening and would come in early in the morning. I often wondered if they discussed much else than physics at home. During my Ph.D., I also observed my advisers working long hours and working on weekends and holidays. Once my advisor even told me that anyone I (a physicist) dated should expect for me to be unavailable on weekends and holidays when there was important physics to do. Still, I was in too deep at this point, so I determined the minimal amount of work I needed to get a Ph.D. and completed that. Now that I have a job as a data scientist, the benefits of my Ph.D. seem to be the people I met and the connections I made (which did help me get the job), but the actual knowledge I gained during the degree has been mostly useless. Looking back, I'm reminded of some of the critical issues with UC Berkeley and academia and have outlined them here.

Academia takes advantage of people
For some reason, during grad school, you are expected to volunteer your time, with no pay or credit. This is especially apparent during the summer when my contract said I was supposed to work 19 hours/week, but my adviser expected me to come in 40+ hours/week. What also shocked me is when I told fellow grad students about this issue, they were not even aware they were only supposed to be working halftime. Further, if an adviser does not have funding and the student teaches during summer to cover costs, the student has no obligation to do research during that summer, but this isn't communicated to the student. These facts are rarely spelled out. The sad thing is, this doesn't end with grad school. I know post-docs (at my institution and others) who have told me that their contracts say they should work around 40 hours/week, but are routinely actually expected to work 50-60 hours/week.

My last semester in grad school I was not enrolled in classes, not getting paid for research and was just working on my thesis. Hence, there was no obligation for me to do anything that was not for my benefit. Still, my advisers tried to guilt me into coming into the office often (with threats of not signing my thesis) and to continue doing work. It still bothers me I paid the university (for full disclosure, I made plenty of money during a summer internship to cover these costs) for the "opportunity" to work for the research group for that semester.

When research is done for course credit, the lines are blurred a bit. The time spent on the "course" is not necessarily fixed. I would think that if it is a "course," then research obligations related to the course should start when the semester starts and end when the semester ends. Certainly, the "course" should not require a student to attend a meeting on a university holiday or weekend (which happened to me during grad school). Most universities have standards and expect professors teaching courses to be available for their students. Some graduate students I know talk to their advisors a couple of times a semester which I would think hardly respects these standards. This has also led me to question what can and cannot be asked of a student in a course. For example, if there were a "course" in T-shirt making that made students work in sweat-shop like conditions to get a grade in a class, would this be legal? While not a tangible product, research for credit is a somewhat similar scenario where the students are producing papers that will ultimately benefit the professor's fame (and a small chance of advancing the student's career).

Another issue is that, as budgets get tighter, expenses get passed off to students. During my time at Berkeley, keeping pens and paper in a storeroom was deemed too expensive. The department suggested each research group make their own purchases of these items through the purchasing website. Not only is this a waste of graduate student's time, but the website was so terrible that often it was easier just to buy items and not get reimbursed for them. Most research is done on student's personal laptops, and while a necessity to continue work, rarely is their support from the university to make this purchase or pay for maintenance when it is used for research work. There is no IT staff, so again it falls on students and post-docs to waste their time dealing with network issues and computer outages rather than focusing on the work that is actually interesting.

Academia tries to ignore that most of its graduate students will not go into academia
I apparently have a roughly 50% chance of "making it" as a professor, which is mostly because I attended a good institution and published in a journal with a high impact factor. Ph.D. exit surveys have found similar rates for the fraction of students that stay in academia. Yet, the general expectation in academia is that all of the students will go on to have a research-focused career (I know some professors who look down upon a teaching-focused career as well even though this is still technically academia).

Even after making it extremely clear to my adviser that I had no intentions of pursuing a postdoc, he told me that I should think about applying. He went so far as to say data science (my chosen profession) was a fad and would probably die out in a few years. Once during his class, he seemed quite proud of the fact that between industry and academic jobs, most of his students had wound up staying in physics. While my adviser wasn't too unhappy with me taking courses unrelated to research, many advisers will actively encourage their students to focus on research and not take classes. This is terrible advice, considering useful skills in computer science and math are often crucial to getting jobs outside of academia.

The physics curriculum, in general, is flawed. At no point in a typical undergraduate/graduate curriculum are there courses on asymptotic analysis, algorithms, numerical methods, or rigorous statistics, which are all useful both inside and outside of physics. Often these are assumed known or trivial, yet this gives physicists a poor foundation and can lead to problems when working on relevant problems. I took classes on all of these topics, though they were optional, and they are proving to be more useful to me than most of the physics courses I took or even research I conducted as a graduate student.

This is a problem at the institutional level as well. For physics graduate students at UC Berkeley, the qualifying exam is set up to test students on topics of the student's choosing. I chose numerical methods and statistics as my main topic, but my committee asked me no questions on numerical methods and statistics. To be fair, one member tried, but he didn't know what to ask, but then again, he could have prepared something to ask since the topics are announced months before the exam. Instead, my committee asked me general plasma physics and quantum mechanics questions which were not topics I had chosen. I failed to answer those questions (and have even less ability to solve it now), but somehow still passed the exam. This convinced me that the exam was nothing more than a formality, yet no one was willing to make changes to make the exam more useful. An easy fix would be to frame the oral exam as interview practice, as most graduate students have no experience with interviews.

There are many student-led efforts to try to make transitioning into a non-academic job easier at UC Berkeley. But the issue is, for the most part, they are student-led and have little faculty support. I was very involved in these groups, and it was quite clear that my adviser did not want me to be involved. Considering that involvement is why I have a job right now, I would say I made the right choice.

Your adviser has a lot of power over you
As I alluded to before, even though I was receiving no money or course credit from my research group in my last semester, my advisers threatened to not sign my thesis unless I completed various research tasks (some unrelated to the actual thesis). I've talked to others that have had similar experiences, and I hate to say I'm confident that this extends beyond research tasks in some research groups. This could be solved by making the thesis review process anonymous and have the advisers take no part in it, but there seem to be no efforts to make this happen.

Another fault is that advisers can get rid of students on a whim. I've known people who have sunk three years into a research group only to be told that they cannot continue. Then, the student has to make a decision to start from zero and spend a ridiculous amount of their life in grad school or leave without a degree rendering the three years in the research group irrelevant. Because of this power, professors can make their students work long hours and come in on weekends and holidays. If the student does not oblige, the time the student has already put into the research group is just wasted time.

It doesn't help that research advisers are usually also respected members of their research area. That is, if a student decides to stay in academia (particularly in the same research area as grad school), the word of their adviser could make or break their career. This again gives an opportunity for advisers to request favors from students.

Further, the university has every motive to protect professors, especially those with tenure, but not graduate students. This became embarrassingly clear, for example, in how the university handles Geoff Marcy before Buzzfeed got a hold of the news. If professors can ignore rules set by the University (and sometimes laws) with little repercussion, there is little faith graduate students can have that new rules the university creates to any of the problems mentioned here will be followed. I don't want to belittle the (mostly student-led) efforts to make sexual harassment less of an issue at UC Berkeley, but ultimately the only change I can pinpoint is that there is now more sexual harassment training, which has been shown by research at UC Berkeley to lead to more sexual harassment incidents.

Ultimately, graduate school and a career in academia work for some people. Some people are passionate about science and love working on their problems, even if that means making a few sacrifices. Progression of science is a noble, necessary goal, and I am glad there are people out there to make it happen. My hope is that many of the problems mentioned here can be rectified so that the experience for those people and also the people who realize that academia isn't for them can succeed on a different path.

Grade Inflation

I have been thinking a lot about teaching lately (maybe now that I will not be teaching anymore) and I hope to write a series of a few blog posts about it. My first post here will be on grade inflation, specifically whether curving is an effective way to combat it.

A popular method to combat grade inflation seems to be to impose a set curve for all classes. That is, for example, the top 25% of students get As, the next 35% get Bs and the bottom 40% gets Cs, Ds, and Fs (which is the guideline for my class). While this necessarily avoids the problem of too many people getting As, it can be a bit too rigid, which I will show below.

In the class I teach, there are ~350 students, who are spread among three lectures. I will investigate what effect the splitting of students into the lectures has on their grade. First, I will make an incredibly simple model where I assume there is a "true" ranking of all the students. That is, if all the students were actually in one big class, this would be the ordering of their grades in the course. I will assume that the assessments given to the students in the classes they end up in are completely fair. That is, if their "true" ranking is the highest of anyone in the class, they will get the highest grade in the class and if their "true" ranking is the second highest of anyone in the class they will get the second highest grade and so on. I then assign students randomly to three classes and see how much their percentile in the class fluctuates based on these random choices. This is shown below


The straight black line shows the percentile a student would have gotten had the students been in one large lecture. The black curve above and below it shows the 90% variability in percentile due to random assignment.

We see that even random assignment can cause significant fluctuations, and creates variability particularly for the students in the "middle of the pack." Most students apart from those at the top and bottom could have their letter grade change by a third of a letter grade just due to how the classes were chosen.

Further, this was just assuming the assignment was random. Often, the 8 am lecture has more freshman because they register last and lectures at better times are likely to fill up. There may also be a class that advanced students would sign up for that conflicts with one of the lecture times of my course. This would cause these advanced students to prefer taking the lectures at other times. These effects would only make the story worse from what's shown above.

We have also assumed that each class has the ability to perfectly rank the students from best to lowest. Unfortunately, there is variability in how exam problems are graded and how good questions are at distinguishing students, and so the ranking is not consistent between different lectures. This would tend to randomize positions as well.

Another issue I take with this method of combating grade inflation is that it completely ignores whether students learned anything or not. Since the grading is based on a way to rank students, even if a lecturer is ineffective and thus the students in the course don't learn very much, the student's score will be relatively unchanged. Now, it certainly seems unfair for a student to get a bad grade because their lecturer is poor, but it seems like any top university should not rehire anyone who teaches so poorly that their students learn very little (though I know this is wishful thinking). In particular, an issue here is that how much of the course material students learned is an extremely hard factor to quantify without imposing standards. However, standardized testing leads to ineffective teaching methods (and teaching "to the test") and is clearly not the answer. I'm not aware of a good way to solve this problem, but I think taking data-driven approaches to study this would be extremely useful for education.

In my mind, instead of imposing fixed grade percentages for each of the classes, the grade percentages should be imposed on the course as a whole. That is, in the diagram above, ideally the upper and lower curves would be much closer to the grade in the "true ranking" scenario. Thus, luck or scheduling conflicts have much less of an effect on a students grade. Then the question becomes how to accomplish this. This would mean that sometimes classes would get 40% As and maybe sometimes 15% As, but it would be okay because this is the grade the students should get.

My training in machine learning suggests that bagging would be a great way to reduce the variance. This would mean having three different test problems on each topic and randomly assigning each student one of these three problems. Apart from the logistic nightmare this would bring about, this would really only work when one lecturer is teaching all the classes. For example, if one of the lecturers is much better than another or likes to do problems close to test problems in lecture, then the students will perform better relative to students in other lectures because of their lecturer. To make this work, there needs to be a way to "factor out" the effect of the lecturer.

Another method would be to treat grading more like high school and set rigid grade distributions. The tests would then have to be written in a way such that we'd expect the outcome of the test to follow the guideline grade distributions set by the university, assuming the students in the class follow the general student population. Notably, the test is not written so that the particular course will follow the guideline grade distribution. Of course this is more work than simply writing a test, and certainly, the outcome of a test is hard to estimate. Often I've given tests and been surprised at the outcome, though this is usually due to incomplete information, such as not knowing the instructor did an extremely similar problem as a test problem in class.

One way to implement this would be to look at past tests and look at similar problems, and see how students did on those problems. (Coincidentally, this wasn't possible to do until recently when we started using Gradescope). This gives an idea how we would expect students to perform, and we can use this data to weight the problem appropriately. Of course, we (usually) don't want to give students problems they'll have seen while practicing exams and so it is hard to define how similar a problem is. To do this right requires quite a bit of past data on tests, and as I mentioned earlier this isn't available. Similar problems given by other professors may help, but then we run into the same problem above in that different lecturers will standardize differently from how they decide to teach the course.

Without experimenting with different solutions, it's impossible to figure out what the best solution is, but it seems crazy to accept that curving classes is the best way. Through some work, there could be solutions that encourage student collaboration, reward students for doing their homework (I hope to write more on this in the future) instead of penalizing them for not doing their homework, and take into account how much students are actually learning.

Code for the figure is available here.

Topic Modeling and Gradescope

In this post, I'll be looking at trends in exam responses of physics students. I'll be looking at the Gradescope data from a midterm that my students took in a thermodynamics and electromagnetism course. In particular, I was interested if things students get right correlate with the physics expectation. For example, I might expect students who were able to apply Gauss's law correctly to be able to apply Ampere's law correctly as the two are quite similar.

I'll be using nonnegative matrix factorization (NMF) for the topic modeling. This is a technique that is often applied to topic modeling for bodies of text (like the last blog post). The idea of NMF is to take a matrix with positive entries, A, and find matrices W and H, also with positive entries, such that

 A = WH.

Usually, W and H will be chosen to be low-rank matrices and the equality above will be approximate. Then, a vector in A is now expressed as the positive linear combination of the small number of rows (topics) of W. This is natural for topic modeling as everything is positive, meaning that cancellations between the rows of W cannot occur.

The data

For each student and for each rubric item, Gradescope stores whether the grader selected that item for the student. Each rubric item has points associated with it, so I use this as the weight of the matrix to perform the NMF on. The problem, though, is that some rubric items correspond to points being taken off from the student, which is not a positive quantity. In this case, I took a lack of being penalized to be the negative of the penalty, and those that were penalized had a 0 entry in that position of the matrix.

There was also 0 point rubric items (we use these mostly as comments that apply to many students). I ignore these entries. But finding a way to incorporate this information could also be interesting.

Once the matrix is constructed, I run NMF on it to get the topic matrix W and the composition matrix H. I look at the entries in W with the highest values, and these are the key ideas on the topic.


The choice of the number of topics (the rank of W and H above) was not obvious. Ideally, it would be a small number (like 5) so it would be easy to just read off the main topics. However, this seems to pair together some unrelated ideas by virtue of them being difficult (presumably because the better students did well on these points). Another idea was to look at the error \| A - WH\|_2 and to determine where it flattened out. As apparent below, this analysis suggested that adding more topics after 20 did not help to reduce the error in the factorization.

With 20 topics, it was a pain to look through all of them to determine what each topic represented. Further, some topics were almost identical. One such example was a problem relating to finding the work in an adiabatic process. Using the first law of thermodynamics and recognizing the degrees of freedom were common to two topics. However, one topic had to be able to compute the work correctly, as the other one did not. This is probably an indication that the algebra leading up to finding the work was difficult for some. I tried to optimize these problems and ultimately chose 11 topics, which seems to work reasonably well.

Some "topics" are topics simply by virtue of being worth many points. This would be rubric items with entries such as "completely correct" or "completely incorrect." This tends to hide the finer details that in a problem (e.g. a question testing multiple topics, which is quite common in tests we gave). These topics often had a disproportionate number of points attributed to them. Apart from this, most topics seemed to have roughly the same number of points attributed to them.

Another unexpected feature was that I got a topic that negatively correlated with one's score. This was extremely counter-intuitive as in NMF each topic can only positively contribute to score, so having a significant component in a score necessarily means having a higher score. The reason this component exists is that it captures rubric items that almost everyone gets right. A higher scoring student will get the points in these rubric items from other topics that also contain this rubric item. Most of the other topics had high contributions from rubric items that fewer than 75% of students obtained.

Many topics were contained within a problem, but related concepts across problems did cluster as topics. For example, finding the heat lost in a cyclic process correlated with being able to equate heat in to heat out in another problem. However, it was more common for topics to be entirely contained in a problem.

The exam I analyzed was interesting as we gave the same exam to two groups of students, but had different graders grade the exams (and therefore construct different rubrics). Some of the topics found (like being able to calculate entropy) were almost identical across the two groups, but many topics seemed to cluster rubric items slightly differently. Still, the general topics seemed to be quite consistent between the two exams.


The plots show a student's aptitude in a topic as a function of their total exam score for three different topics. Clearly, depending on the topic the behaviors can look quite different.

Looking at topics by the student's overall score has some interesting trends as shown above. As I mentioned before, there are a small number (1 or 2) topics which students with lower scores will "master," but these are just the topics that nearly all of the students get points for. A little over half the topics are ones which students who do well excel at, but where a significant fraction of lower scoring students have trouble with. The example shown above is a topic that involves calculating the entropy change and heat exchange when mixing ice and water. This may be indicative of misconceptions that students have in approaching these problems. My guess here would be that students did not evaluate an integral to determine the entropy change, but tried to determine it in some other way.

The rest of the topics (2-4) were topics where the distribution of points was relatively unrelated to the total score on the exam. In the example shown above, the topic was calculating (and determining the right signs) of work in isothermal processes, which is a somewhat involved topic. This seems to indicate that success in this topic is unrelated to understanding the overall material. It is hard to know exactly, but my guess is that these topics test student's ability to do algebra more than their understanding of the material.

I made an attempt to assign a name to each of the topics that were found by analyzing a midterm (ignoring the topic that negatively correlated with score). The result was the following: heat in cyclic processes, particle kinetics, entropy in a reversible system, adiabatic processes, work in cyclic processes, thermodynamic conservation laws, particle kinetics and equations of state, and entropy in an irreversible system. This aligns pretty well with what I would expect students to have learned by their first midterm in the course. Of course, not every item in each topic fit nicely with these topics. In particular, the rubric items that applied to many students (>90%) would often not follow the general topic.

Ultimately, I was able to group the topics into four major concepts: thermodynamic processes, particle kinetics and equations of state, entropy, and conservation laws. The following spider charts show various student's abilities in each of the topics. I assumed each topic in a concept contributed equally to the concept.


Aptitude in the four main concepts for an excellent student (left) an average student (middle) and a below average student (right).


Since the data is structured to be positive and negative (points can be given or taken off), there may be other matrix decompositions that deal with the data better. In principle, this same analysis could be done using not the matrix of points, but the matrix of boolean (1/0) indicators of rubric items. This would also allow us to take into account the zero point rubric items that were ignored in the analysis. I do not know how this would change the observed results.

I had to manually look through the descriptions of rubric items that applied to each topic and determine what the topic being represented was. An exciting (though challenging) prospect would be to be able to automate this process. This is tricky, though, as associations that S and entropy are the same could be tricky. There may also be insights from having "global" topics across different semesters of the same course in Gradescope.

The code I used for this post is available here.

Resistors and Distance on Graphs

I feel bad for not having written a post in so long! I have been busy with teaching, research, and various other projects. Now that my teaching duties are done, I will try to post more regularly!

A graph is a collection of nodes that are connected by edges that are drawn under a defined condition. These edges may or may not have weights. Graphs are useful representations of data in many scenarios. For example, the Internet is an example of a graph. Each web page would be a node and an edge would be drawn between two nodes if one of the pages links to the other. These edges could be unweighted or could represent the number of links between the two pages. Another example is social media. For example, the nodes could be all users of the service and an edge would be drawn if two nodes are "friends" with each other.

One way to define a distance between two nodes on the graph may be to find the shortest path between the nodes. This would be defined as the collection of edges from one node to the other such that the sum of the weights of the edges (or simply the number of edges in the case of an unweighted graph) is minimized. This is what LinkedIn seems to do when they compute the "degree" of connection of a stranger to you. This is also how Bacon and Erdös numbers are defined.

Fig. 1: Two graphs with the same shortest path distance between A(lice) and B(ob). However, it is clear that in the right graph A(lice) seems better connected to B(ob) than in the left graph.

A shortcoming of this measure, though, is that shortest path ignores how many paths there are from one node to the other. This scenario is depicted in Fig. 1. Suppose in a social network, we would like to determine something about how likely it is that one person (say Alice) will meet another (say Bob) in the future. Suppose Bob is a friend of only one of Alice's friends. Then, given no other information, it seems unlikely that Alice would ever meet Bob, since there is only one avenue for Alice to meet Bob. Of course, this could be different if Bob was good friends with Alice's significant other, so we might want to weight edges if we have information about how close Alice is to her friends in the social network. Now, if Bob is friends with half of Alice's friends, it seems quite likely that when Alice goes to a party with her friends or is hanging out with one of those friends, then Alice will run into Bob. In both of these cases, the shortest path distance between Alice and Bob is the same, but the outcome of how likely they are to meet (which is what we want to analyze) is different.

It turns out that a useful analogy can be made by considering each edge as a resistor in a resistor network. In the unweighted case, each edge is a resistor with resistance 1 (or 1~\Omega if having a resistance without units bothers you, though units will be dropped for the rest of the post) and in the weighted case, each edge is a resistor with resistance equal to the inverse weight of the edge. We will see that the effective resistance between two nodes is a good measure of distance that more accurately captures the scenario described above. Groups of nodes with small effective resistance between them will correspond to clusters of people (such as people who work in one workplace) in the social network.

The effective resistance satisfies the triangle inequality, which is a defining property of distances [1]. We can see this as follows. The effective resistance between nodes a and b is the voltage between the two nodes when one unit of current is sent into a and extracted from b. Let the voltage at b be zero (we are always free to set zero potential wherever we like). Then R_{ab} = v_a = (v_a-v_c)+v_c. Now, we know that v_a-v_c \leq R_{ac}, since the potential should be maximal at the source (current doesn't climb uphill). Similarly, v_c\leq R_{cb} (remember node b is grounded). Thus, we have that R_{ab} \leq R_{ac}+R_{cb}, and the triangle inequality holds. The other defining characteristics hold quite trivially, so the effective resistance is a reasonable way to calculate distance.

Now consider the example presented before and depicted in Fig. 1. If Bob is only friends with one of Alice's friends, and there are no other links between Alice and Bob, then the effective resistance between Alice and Bob is 2. In this case, the effective resistance is the same as the shortest path distance. If Bob is friends with 7 of Alice's friends, and there are no other links between Alice and Bob, the effective resistance between Alice and Bob is 0.29. So, we see that this satisfies the property that having one connection is not as important as having many connections to Bob.

Another interesting consequence of this analogy is related to random walks. Suppose a walker starts at a node v, and chooses a random edge connected to that node to walk along (the random choice is weighted by edge weights if the graph is weighted). Then, the expected number of steps to start at v, go to another node w, and return back to v (which is called the commute time) is proportional to the effective resistance between v and w. One way to think about this is that a way to estimate effective resistance, then, on a large resistor network would be through the use of random walks [2]. This is interesting for tricky resistor problems such as the infinite grid of resistors. This also reinforces the idea that effective resistance is a good way of quantifying communities. When one node is well connected with another, it should be relatively easy to commute between the nodes, and thus they should be part of the same community. Further, this notion can be used to place bounds on the maximum possible commute time.

Effective resistance has many other uses as well. Quickly computing effective resistance turns out to be quite useful in speeding up an algorithm that solves the maximum flow problem [3]. In addition, sampling edges by weights proportional to the effective resistance across the edge yields an algorithm that sparsifies graphs. This means that the new graph has fewer nodes and edges than the original graph, but looks "close" to the original graph, and is a more efficient representation of that information [4].

1. Lau, L.C., 2015. Lecture notes on electrical networks.
2. Ellens, W., 2011. Effective Resistance.
3. Christiano, P., et al., 2010. Electrical Flows, Laplacian Systems, and Faster Approximation of Maximum Flow in Undirected Graphs.
4. Spielman, D. and Srivastava, N., 2011. Graph Sparsification by Effective Resistances. SIAM J. Comput., 40(6), 1913–1926.

Magnetic fields in Physibounce

One of the things that has frustrated me about Physibounce is that when the magnetic field is on, the radius of the cyclotron motion of the particle isn't constant. Here I will write about why this problem occurred and what can be done to fix it.

In Physibounce, the magnetic fields are constant in time (other than turning on or off). When the magnetic field is turned on, the magnetic field can either point into or out of the screen. Let me call this the \hat{z} direction. The equations of motions for the particles in Physibounce are

 \frac{d\vec{v}}{dt} = \frac{\vec{F}}{m}+\omega_c \vec{v}\times\hat{z}

 \frac{d\vec{x}}{dt} = \vec{v}.

where \omega_c = \frac{eB}{m}, with B is the magnitude of the magnetic field, e is the charge of the particle, and m the mass. F is all forces acting on the particle that are not the magnetic field. Before now, Physibounce used an semi-implicit Euler method. Let us see why the semi-implicit Euler method is bad. For simplicity, take the case when there are no forces other than the magnetic field applied. The update equation for semi-implicit Euler is

 \vec{v}(t+\Delta t) = \vec{v}(t) +\omega_c\Delta t(v_y \hat{x}-v_x\hat{y})

 \vec{x}(t+\Delta t) = \vec{x}(t) + \vec{v}(t+\Delta t)\Delta t.

Assuming I measure distances from the center of the circular motion of the particle, ideally the radius of the orbit does not change. Since the magnetic force is causing the circular motion, it must be that m|\vec{v}(t)|^2/|\vec{x}(t)| = e |\vec{v}(t)| B, or equivalently, |\vec{v}(t)| = \omega_c |\vec{x}(t)|. Thus, enforcing that the radius of orbit does not change is equivalent to enforcing that kinetic energy is conserved as \frac{1}{2} m |\vec{v}(t)|^2 = \frac{1}{2} m \omega_c^2|\vec{x}(t)|^2 (remember that magnetic fields do no work). Consider the velocity squared at the updated timestep

 |\vec{v}(t+\Delta t)|^2 = |\vec{v}(t)|^2 + \omega_c^2\Delta t^2 |\vec{v}(t)|^2.

Thus, every timestep, a mistake of \frac{1}{2} m|\vec{v}(t)|^2\Delta t^2\omega_c^2 is made to the kinetic energy. Since this is proportional to the current kinetic energy, this means that the error gets worse and worse as time goes on, which is why semi-implicit Euler is bad for modeling magnetic forces. This is also why in Physibounce the radius of the orbit increases for a particle in a magnetic field.

One solution to the problem would be to use polar coordinates. Then the semi-implicit Euler update would be

 r(t+\Delta t) = r(t)

 \theta(t+\Delta t) = \theta(t) + \omega_c \Delta t.

which looks quite simple. In this form, it is obvious that the radius of rotation is conserved. However, if I wanted to use this method, I would have to write all of the other forces of Physibounce in polar coordinates. This would be quite a bit of work to work out, so I did not choose this method.

The solution to this problem while still using Cartesian coordinates is to use a leap-frog-like scheme, specifically one called the Boris pusher. The idea of the Boris scheme is to push the velocity by a half timestep using only the force \frac{F}{m}, then make a rotation due to the magnetic field, and then push the velocity another half timestep using the same force as before [1]. As before, the position is updated with the updated velocity. I calculate the following quantities

 \vec{v}^- = \vec{v}^{n} + \frac{\vec{F}^n}{m}\frac{\Delta t}{2}

 \vec{v}^+ = \vec{v}^{-} + \frac{\omega_c \Delta t}{1+\omega_c^2\Delta t^2/4}(v^{-}_y\hat{x}-v^-_x\hat{y})-\vec{v}^-\frac{\omega_c^2 \Delta t^2}{2(1+\omega_c^2\Delta t^2/4)}

Then I update

 \vec{v}^{n+1} = \vec{v}^+ + \frac{\vec{F}^n}{m}\frac{\Delta t}{2}.

I will again consider the case when the magnetic field is the only force, but the more general case is considered in [1]. In this case, the update has the nice property that

 |\vec{v}(t+\Delta t)|^2 = |\vec{v}(t)|^2.

Thus, kinetic energy is conserved, and hence the radius of orbit of the particle is constant. This creates a more realistic simulation and does not make the computation of all the other forces in Physibounce any more complicated. There is a minimal amount of code required to incorporate it into a simulation already using a semi-implicit Euler method. These pluses make it ideal for solving the problem in Physibounce.

The Boris pusher is widely used in plasma physics simulations as it is quite fast (in terms of number of flops required) and represents the dynamics without too much error. It is used in N-body and PIC simulations.

I have updated Physibounce (first update in about a year!) to reflect this.

1. Qin, H. et al., 2013. Why is Boris algorithm so good? Phys. of Plas. 20-8.

Pedestrian dynamics

I've written about modeling the movement of cars as a fluid in the past. We could think about pedestrians like this, but usually pedestrians aren't described well by a fluid model. This is because while cars are mostly constrained to move in one direction (in lanes), this is not true of pedestrians. On a sidewalk, people can be walking in opposite directions, and often someone walking in one direction will directly interact (by getting close to) someone walking in the opposite direction. There are some specific scenarios where a fluid model could work, such as a crowd leaving a stadium after the conclusion of a basketball game. In this case, everyone is trying to get away from the stadium so there is some kind of flow. However, this doesn't work generally, so I will consider a different type of model, similar to the one described in [1].

If there are only two directions that people want to travel and they happen to be opposite, then we could model the pedestrians as charged particles. The pedestrians that want to go in opposite directions would be oppositely charged, and the force that keeps the pedestrians on a trajectory could look like an electric field. However, this would mean that people moving in opposite directions would attract each other, which really does not match expectations. This model also fails if there are multiple directions where pedestrians want to go (such as at an intersection), or if the desired directions of the pedestrians are not opposite. While a plasma (which is a collection of charged particles) model may not be the best to describe the scenario, I will borrow some ideas from plasma dynamics in building my model and I will use techniques used to simulate plasmas to simulate pedestrian movement.

There will be a few effects governing the movement of pedestrians. One effect is for the pedestrians to want to go a desired direction at a desired speed. It turns out that most humans walk at a speed of around v_d=1.4 m/s (3.1 mph) and if someone is going slower or faster than this, they will tend to go toward this speed. Let me call the desired speed along the desired direction of the pedestrian \vec{v}_d, and the current walking speed of the pedestrian \vec{v}. I will model the approach to the desired direction as a restorative force that looks like

 \vec{F}_{restore} = \frac{\vec{v}_d-\vec{v}}{\tau}.

Here, \tau represents how long it takes the pedestrian to get back to their desired position and direction once they are off track. In general, \tau could be different for every pedestrian, but for simplicity I set it as a constant for all pedestrians here, and I will take it to 0.3 s, which is close to the human reaction time. Note that the restorative force is zero when \vec{v_d}=\vec{v}, so if a pedestrian is already going in their desired direction at their desired speed, there will be no restorative force and the pedestrian will continue to go at this direction and speed. You may find it odd that my force has units of acceleration. I am thinking about this more as a generalized sense of the term force as in something that causes velocity changes, but it would also be reasonable to assume that I have set the mass of the pedestrians to 1.

Pedestrians will also avoid colliding with each other, which is the other force I include in the model. While [1] assumes an exponential force for the interaction force, I will assume that pedestrians interact via a generalized Coulomb potential. The general results seem to match without too much regard for the exact shape of the force. I define the force between pedestrian i and pedestrian j is

 \vec{F}_{ij} = \gamma \left(\frac{|\vec{r}_{ij}|}{r_0}+\epsilon\right)^{-\alpha}\hat{r}_{ij}.

Where \vec{r}_{ij} = \vec{r}_i - \vec{r}_j. \gamma, \epsilon, \alpha, and r_0 are constants that I will describe below. r_0 is an interaction radius that sets the scale for this interaction. This would not necessarily be the same for everyone. For example if someone is texting, their interaction radius r_0 is probably much smaller than someone who is paying attention to where they are going. However, for simplicity I take it to be the same for everyone, and I take it to have a value of 1.2 m.

Since pedestrians travel in 2 dimensions, if \alpha = 1 and \epsilon = 0, this would be the Coulomb potential, if \gamma were aptly chosen. In this scenario, however, I do not really want the Coulomb potential. The Coulomb potential is quite long range, meaning that particles in a Coulomb potential can influence particles that are quite far away. As the power \alpha in the equation above gets larger, the force becomes more short-range, which seems to better model the interactions of pedestrians. However, this presents another problem in that the force gets extremely large if two pedestrians happen to get really close to one another. To combat this, \epsilon is a small number that "softens" the force such that the force never gets extremely large (which I took to mean |\vec{F}_{ij}| should never be too much bigger than the maximum possible value of |\vec{F}_{restore}|). \gamma then decides the relative importance of this interaction force to the restorative force.

I will simulate this model by considering N people are in a long hallway with aspect ratio 1:10, for example at an airport or a train station. This can also be a model for a long, wide sidewalk as even though there are no walls, people are relatively constrained to stay on the sidewalk. I have some people trying to get to one end of the hallway (in the +\hat{x} direction) and some trying to get to the other end (in the -\hat{x} direction). This is an example of an N-body simulation, which is widely used in studying gravitational systems and plasma systems.

In [1], the walls exerted an exponential force on the pedestrians. I choose a similar model. I set the parameters of the exponential empirically such that the pedestrians keep a reasonable distance from the walls. I set the range of the exponential force to be a tenth of the total width of the corridor. I set the magnitude such that at the maximum, the force due to the wall is the same as the maximum value of |\vec{F}_{restore}|.

When a pedestrian reaches the end of the corridor, I record how much time it took for that pedestrian to traverse the corridor. I then regenerate the pedestrian at the other end of the corridor as a new pedestrian. I generate the new pedestrian with a random y coordinate and a random velocity direction, but pointing at least a small bit in the desired direction. The magnitude of the velocity is taken to be v_d. Thus, the simulation is set up such that there will always be N people in the hallway.

A simulation of N=100 people in a hallway of dimensions 100 m x 10 m. All pedestrians desire to go to the left of the hallway. The pedestrians relax to a state where they are each about the same distance from each other. It seems that people usually stand closer together on average, so our value of r_0 should probably be smaller to match observations.

The first thing I tried was to simply put a few people in the hallway all wanting to go in the same direction, and see what they do. I set the length of the hallway to be 100 m, which made the width of the hallway 10 m. As can be seen above, this isn't too exciting. The pedestrians' paths are mostly unobstructed and they get across the hallway in about 71 s, which is the length of the hallway divided by 1.4 m/s, the comfortable walking speed of the pedestrians. Even in this simple case, though, it is apparent that the pedestrians "relax" into a scenario where the average distance between the pedestrians is roughly the same.

A simulation of N=100 people in a hallway of dimensions 50 m x 5 m. Pedestrians are equally likely to want to go left or right. We can see that lanes of people that would like to go in the same direction can form, as was observed in [2]. This effect could be even stronger with an extra "incentive force" for people to be on the right side of the road if they are not already on that side.

The time required to cross the room as a function of density of people in a simulation of N=100 people. The y-axis is normalized by the length of the room divided by the desired velocity (1.4 m/s). p=0.5, which means half of the pedestrians desire to go to the left and the other half desire to go to right. I change the density of people by changing the size of the room from 2.5 m x 25 m to 10m x 100 m. As the density is higher, the pedestrians interact more with each other and thus are less likely to be on their desired trajectory.

Next, I looked at the more interesting cases of what happens when there are pedestrians that want to go in different directions. First, I assume that exactly half of the pedestrians would like to go in one direction and half would like to go the other direction. I then varied the length and width of the hallway, keeping the aspect ratio constant, while keeping the number of people in the hallway constant at 100. This has the effect of changing the density of people in the hallway. The y-axis on the graph above is normalized by L/v_d, which is the time a pedestrian with all of his or her velocity in the desired direction would take. This shows that as the density increases, it takes longer (proportionally) for the pedestrians to get across the room. This makes sense as the pedestrians are interacting more often and thus cannot keep going in the desired direction.

A simulation of N=100 people in a hallway of dimensions 50 m x 5 m. 90% of pedestrians want to go left, while the other 10% want to go right. The right-going pedestrians undergo many interactions with the left-going pedestrians. In fact, if the left-going pedestrians were denser, this could look like Brownian motion.

The time required to cross the room as a function of p, the fraction of N=100 people that would like to go left or right. The y-axis is normalized by the length of the room divided by the desired velocity (1.4 m/s). The size of the room is 5 m x 50 m. The blue line is the time to get across for the pedestrians going leftward, and the red line in the time to get across for the pedestrians going rightward. As the fraction of pedestrians going leftward increases, it becomes easier for those pedestrians to get across, but it makes it harder for the pedestrians that would like to go in the opposite direction to get across more slowly.

I then took the number of people in the hallway to be 100 with the length of the hallway being 50 m and the width being 5 m. I observed what happened as I varied the fraction of pedestrians, p that wanted to go in either direction. This effect is shown above. As p is increased, the more dominant pedestrians can get through the corridor more quickly than the less dominant pedestrians. Again, this makes sense as when people go "against the gradient," they have to weave through people to try to get to the other side.

I will note that I have not done this simulation in the most efficient way. For every pedestrian, I calculate the interaction force with all the other pedestrians and add up all the contributions. It turns out one can average or sometimes even ignore the effect of pedestrians far away, which can make the code run about 1/N times faster.

The python and gnuplot scripts I used for the simulation and to create the plots are available here.

1. Kwak, J., 2014. Modeling Pedestrian Switching Behavior for Attractions. Transportation Research Procedia. 2. 612-617.
2. Tao, X., 2011. A Macroscopic Approach to the Lane Formation Phenomenon in Pedestrian Counterflow. Chinese Phys. Lett. 28.

Entropy and least squares

While most of the papers I reference in this blog are relatively new, this article will discuss an old idea, described in some detail in [1]. Entropy has been used as a concept in information theory for over half a century. One definition of the thermodynamic entropy is

 S = - k_B \sum_i p_i \ln(p_i).

Here k_B is the Boltzmann constant and p_i is the probability that the system is in the ith microstate. The sum is over all of the microstates, which is one particular configuration for the system. For example, if I were calculating the entropy of a gas molecule in a box, a microstate would be the gas molecule with a particular position and momentum.

Since this is an equation with probability in it, it seems natural to extend the idea to other fields that deal with probability. This is the idea behind the Shannon entropy of information theory and statistics. This entropy is defined in a similar way. If one has a discrete probability distribution, p(x_i), then the Shannon entropy is

 S = - \sum_i p(x_i) \log_2(p(x_i)).

Here the sum is over all possible values of x_i.

The second law of thermodynamics states than an isolated system will always try to maximize its entropy. Thus, if I want to determine what the equilibrium configuration of a system is, I can do this by maximizing the entropy of the system. This is the approach [2] takes in "deriving" thermodynamics from basic statistical mechanics. The entropy is maximized subject to the constraint that the distribution should have a certain energy or particle number, and the Lagrange multipliers enforcing the constraints turn out to be related to temperature and chemical potential.

A similar idea can be applied to statistics. Suppose we would like for a probability distribution to have average \mu and standard deviation \sigma, but have as little other information encoded as possible. In other words, given a set of data points x_i, I would like to find p(x_i), the assignment of probability to each of the x_i, such that the average x_i value will be \mu and the standard deviation is \sigma^2. I would do this by maximizing the Shannon entropy (by setting dS = 0) subject to some constraints. These constrains are

 \sum_i p(x_i) = 1

 <x data-recalc-dims= = \sum_i p(x_i) x_i = \mu" />

 <x^2 data-recalc-dims= = \sum_i p(x_i) x_i^2 = \sigma^2 + \mu^2." />

By setting dS=0, the form of the equation for p(x_i) becomes

 p(x_i) = e^{-\lambda_1-\lambda_2 x_i-\lambda_3 x_i^2}.

Here, the \lambda_j are the Lagrange multipliers. I then plug this form into the constraint equations to solve for the values of \lambda_j. This gives

 \lambda_1 = \ln(Z)

 \mu = -\frac{\partial}{\partial \lambda_2} \ln(Z)

 \sigma^2+\mu^2 = -\frac{\partial}{\partial \lambda_3} \ln(Z)

where Z, the partition function, is

 Z = \sum_i e^{-\lambda_1-\lambda_2 x_i-\lambda_3 x_i^2}.

So I see the partition function is a useful object in the statistical sense as well. We can't simplify this any further without knowing specific values for x_i, but given this information it would be easy to solve for the values of the Lagrange multipliers.

This procedure produces some reasonable results. For example, if the set of points is the entire real axis and I would like to apply the constraints above (though I have to do things a little differently since this is a continuous case), the distribution this procedure gives turns out to be a Gaussian. Thus, the Gaussian is a distribution over the whole real line that has a set average and standard deviation but encodes as little other information as possible.

There is a notion of "relative entropy" that may not be as familiar to physicists (at least I had never heard of it). This is called the Kullback-Leibler (KL) divergence. This can be quantified as (this is actually the negative of the KL divergence)

 B(P;Q) = -\sum_i P(x_i) \ln\left(\frac{P(x_i)}{Q(x_i)}\right).

The KL divergence of a distribution P from a distribution Q quantifies how much information is lost when Q is used to approximate P. This seems like a nice thing to consider in the context of regression. I will follow [3] to use this to show how to compare two fit models and determine which one is more robust.

Let me assume there is some true distribution f(y) and I am approximating it by a function g(y|x). Now consider the expected entropy in x (here E_x will denote the expectation value with respect to x). This is

 E_x B(f;g(|x)) = - E_y \ln(f(y)) + E_xE_y\ln(g(y|x)).

Now suppose there were another model, g'(y|x). I would like to consider whether g or g' describes f better. I can look at this by looking at the difference in the expected entropy of the two.

 E_x B(f;g(|x))-E_x B(f;g'(|x)) = E_x(E_y\ln(g(y|x))-E_y\ln(g'(y|x)))

I have made a measurement of \ln(g(y|x))-\ln(g'(y|x)) by performing the fit, as this is a ratio of log-likelihoods. Asymptotically, (look at [3] for details) this measurement will differ from the expected value by 2(k-k'), where k and k' are the number of parameters used in the fit of g and g', respectively. Correcting for this bias, the difference in entropies is

  E_x B(f;g(|x))-E_x B(f;g'(|x)) = 2\ln(L)-2k-(2\ln(L')-2k')

where L and L' are the likelihood functions of the model g and g'. Thus, while the logic was a bit complicated, the difference in entropies shows us an amazingly simple result! All that I need to do to compare the quality of two models is to look at the difference of twice the log-likelihood and the number of fit parameters. This is the Akaike information criterion (AIC). The AIC is an extremely useful metric to decide whether a model is being over- or under-fitted!

1. Jaynes, E.T., 1957. Information Theory and Statistical Mechanics. Physical Review 106-4, 620-630.
2. Pathria, R.K., 1972. Statistical Mechanics.
3. Akaike, H, 1985. Prediction and Entropy. A Celebration of Statistics 1, 1-24.

Ising model and voting patterns

The Ising model is a ubiquitous model in condensed matter physics as it is a simple model with nontrivial behavior. It is one of the few systems that has analytic solutions, though this only holds for the Ising model in low dimensions. Despite its simplicity, it does quite well at describing the ferromagnetic systems.

While the Ising model is typically applied to ferromagnetic spin systems, even within physics it can also be used to describe a gas that lives on a lattice. Here, I will show how the model can be extended outside of physics and describe how the Ising model relates to voting.

Assume there are N people that will vote in an election. For simplicity, assume there are just two candidates in the election (or the vote is for a yes/no proposition). Let the state of voter i be characterized by a variable s_i that take the value of +1 or -1, corresponding to voting for one of the two outcomes of the election. The assumption is that each voter will interact with other voters and convince them that they should vote a certain way. Let the energy associated with a voter agreeing with someone they interact with be -\epsilon and the energy associated with a voter disagreeing with someone they interact with be +\epsilon.

For each person, there will be a propensity to vote a certain way, such as being affiliated with a political party or being subjected to advertising campaigns. Let the energy of this propensity be called h_i for person number i. Let h_i be measured on a scale of -\beta (always voting for candidate -1) to +\beta (always voting for candidate +1). If a voter has a 0 value for this propensity, they are a "swing voter" and can be convinced by his or her neighbors to vote for either the +1 or -1 candidate. In a physical system, this propensity is equivalent to the existence of an external magnetic field in the spin system. Putting all this together, the total energy among the N voters is

 E = -\epsilon \sum_{<ij data-recalc-dims=} s_i s_j+\sum_i h_i s_i." />

The angled brackets here mean only take the sum if i and j are able to interact. This is in here as clearly all N voters will not talk to every one of the others unless N is very small. This equation is precisely the equation for the Ising model. The equilibrium state will be one when this energy is minimized (though this minimum energy state is usually not unique). Now suppose all of the voters are on a 2D lattice (or grid), meaning that each voter interacts with 4 other voters. In the \beta=0 case (no political parties), then an analytic solution exists.


Configuration of voters on a 2D lattice with \beta=0. Black are voters for candidate +1 and white are voters for candidate -1. The left is the case where \frac{\epsilon}{T} = 0.5, the right is where \frac{\epsilon}{T} = 0.33. There is clearly clustering of similarly minded people on the left, but there is no such structure on the right. This is an example of a phase transition.

As shown above, in this case, depending on whether the "social temperature" is above or below a critical temperature, the voters will have random opinions or will form large regions of agreement. Note that as before, this temperature does not correspond to how hot it is where the voters live, but a measure of how nonideal voters are willing to be with regards to the model. If the temperature is high, random fluctuations dominate and so there is no correlation among voters. However, if the temperature is low, the voters tend to form regions where they agree with others. The critical temperature can depend on the issue at hand. I tend to agree with most of the people I associate with, so in my experience it seems that voting systems are usually in the low energy limit. However, there are some issues that are not as clear (such as GMO labeling) where the voting system can be in the high energy limit.

Now let me consider the effect of political parties. This case cannot be solved analytically, so the only way to solve the problem is numerically. The best way to do this is using the Metropolis algorithm. In this method, we first assign randomly to all voters which candidate they will vote for (give them +1 or -1). We then choose a voter at random and consider the energy change \Delta E if we changed that voter's vote. If switching the vote will decrease the total energy of the system, then the voter's vote is changed. If not, the voter's vote is changed with probability e^{-\Delta E/T}, where T is the temperature of the system. If T=0, this second condition never occurs and thus a voter's opinion is changed only if it decreases the total energy. Thus, as mentioned earlier, a nonzero temperature has the effect of measuring how nonideal voters are willing to be which corresponds to increasing the energy. This probability is called a Boltzmann factor and is ubiquitous in statistical physics. After this is done, another voter is chosen randomly and the process is repeated many times. After a long time, this algorithm produces a state that is typical of an equilibrium configuration of the voters.

I assume my voters live on a 2D lattice. This means that each voter interacts with 4 other voters and no one else. One reason I choose this is that it is particularly easy to visualize. I chose the number of voters to be 62500. I repeated the Metropolis algorithm 10^7 times, so that every voter will probably be considered by the algorithm many times. I took periodic boundary conditions such that voters on the edge interact with voters on the other edge. This insured that every voter interacted with 4 other voters.

Configuration of voters on a 2D lattice with \frac{\epsilon}{T} = 0.5 and \frac{\beta}{T}=2. The h_i values for each voter were assigned randomly between -\beta and +\beta. Black are voters for candidate +1 and white are voters for candidate -1. Unlike the \beta=0 case, there is now less clustering of like-minded voters.

The first thing I tried is shown above. I randomly assigning values of h_i between -\beta and +\beta to each of the voters. I still assign randomly the initial configuration of votes. This has the effect of adding more disorder to the voters, and can prevent clusters of similarly minded voters from forming, even at low temperatures.


Configuration of voters on a 2D lattice with \frac{\beta}{T}=0.5, with \beta distributed from negative to positive linearly from one end of the lattice to the other. Black are voters for candidate +1 and white are voters for candidate -1. The left is the case where \frac{\epsilon}{T} = 0.5, the right is where \frac{\epsilon}{T} = 0.33. There is a tendency in both images for votes to cluster toward one side.

This, however, doesn't seem too realistic, as like-minded people usually associate with each other. Because of this, I tried adjusting h_i such that it varies linearly from one edge of the lattice to the other. This has the effect of encouraging order, and predictably groups similarly minded people (when \beta=0, the choice of grouping is random and is an example of spontaneous symmetry breaking).

When the choice of h_i is such that positive and negative values are equally likely, there is no clear winner in the election (the difference in +1 and -1 votes is usually very close to 0). However, if there is an asymmetry between the number of people who can have negative h_i values and positive h_i values, this is usually a good indicator of who will win the election. This asymmetry could be due to something as simple as one campaign having better ads than the others and getting the swing voters.

One extension that could be made is to have different weights for certain interactions. Reading a distant friend's post on Facebook is probably less relevant to you than a discussion with a close family member. To account for this, the value of \epsilon could be adjusted such that it takes on different values for different pairs of interactions. Another possible extension is to consider different interaction schemes. It seems unlikely that every voter would have only four people they interact with. We could do something more realistic such as having a range of a number of interactions each voter can have. This is a little more complicated to program, however, as a different data structure (a graph) has to be used.

You can see the code I used here.

1. Schroeder, Daniel. 2000. An Introduction to Thermal Physics.

City Population and the Fokker-Planck equation

Zipf's law states that many types of data, when sorted, look like a Pareto distribution and when plotted on a log-log scale look like a straight line. In other words, many features tend to have power law behavior. This is simply described by the Pareto principle which states that often a small percentage of causes account for a large percentage of effects. This is exemplified by the 99% slogan used by the Occupy movement stating that the top 1% of people in the U.S. earned most of the income.

Zipf's law seems to be quite ubiquitous. Examples where similar trends have been observed are the number of crimes criminals commit, the amount of healthcare patients receive, the amount of complaints customers give, and many others [1]. This article is about the distribution of people across cities, which also follows Zipf's law.

[2] and [3] look at Zipf's law for wealth distribution. Here, I'll present the same basic argument they do, but in terms of population, as this seems a little more intuitive to me.

The above is a plot of the population of U.S. cities with more than 50,000 people by their rank in size. The data is from [4]. The orange line is the linear (on a log-log scale) fit. As you can see, the line fits the data quite well, especially toward the smaller cities. The slope of the line is about 0.725.

Now I will consider the mathematical model. Say there are \mathcal{N} cities. The only assumption that I will make is that people can move between two cities following a simple rule. I will assume that two cities can move some (presumably small) constant \beta times the minimum of the population of the two cities. I assume that the flow is random, such that this movement can occur either way.

Now assume two cities have population n and n'. The change, \Delta, in population of the city with population n after someone moves is

 \Delta = \beta r \min(n,n').

The change in population of the city with population n' is -\Delta. r is a random variable that has a 50% chance of being +1 and a 50% chance of being -1. The PDF, P(n,t) of a system undergoing a Markovian process of small steps like this is given by the Fokker-Planck equation.

 \frac{\partial P}{\partial t} = - \frac{\partial}{\partial n}\left(<\Delta data-recalc-dims=P\right)+\frac{\partial^2}{\partial n^2}\left(\frac{<\Delta \Delta>}{2}P\right) = \beta^2 \frac{\partial^2}{\partial n^2}\left(\left(\frac{n^2}{2}A(n)+B(n)\right)P\right)" />


 A(n) = \frac{1}{\mathcal{N}} \int_n^\infty dn' P(n'),

is the Pareto function and

 B(n) = \frac{1}{\mathcal{N}} \int_0^n dn' P(n')\frac{{n'}^2}{2}.

The angled brackets here represent expectation values. For a derivation of this relation, see [3]. In plasma physics, if the plasma is sufficiently dense, the constituents of the plasma will have many interactions with one another. If one particle in the plasma is looked at, the path looks like a simple random walk, which is a Markovian process. Thus, the PDF of the plasma will obey what looks like a Fokker-Planck equation. Similar arguments can be used to obtain a similar relation for the PDFs of particles undergoing Brownian motion. The Fokker-Planck equation has even found uses in mathematical finance.

One thing to note is that the value of \beta cannot change the dynamics. I can change my units of time to absorb \beta, and so nothing critical should depend on it (and I do not include \beta when I do the numerical integration). The PDE is not easy to solve analytically as it mixes up derivatives and integrals of the PDF P(n,t). Thus, I solve it numerically by assuming an initial distribution for P(n,0) and letting it evolve in time until it seems like the PDF has reached a steady state. The initial distribution I assumed was exponential.

The population distribution of the model. Given the blue initial conditions, the system evolves to the purple line which looks linear over a large number of ranks. The straight orange line has slope 1.05. I believe that the kink that develops is a numerical effect and could be reconciled if the step size in space were smaller. However, because of the CFL condition this requires an even smaller time step and thus would make the code take an even longer amount of time.

Here, it is apparent that the PDF approaches a straight line in the log-log plot, as was apparent in the U.S. census data. It turns out that the choice of the initial condition does not matter much. In [2] the authors choose a much different initial condition but obtain a similar steady state solution after a long time.

One key difference between the result of this model and the U.S. population is that the slope of the lines are quite different. In the plot above, the slope of the line is about 1.05, while the U.S. population has a slope of about 0.725.

One way to modify the model to potentially reconcile this is to allow for moving to and from a city to have different probabilities. Intuitively, I'd expect more people would want to go from a smaller city to a bigger city rather than the other way around. It turns out this effect is pretty easy to incorporate. In the relation above for \Delta, I change r such that it is a random variable that has an a chance of being +1 and a 1-a chance of being -1. Before, I took a=1/2.

In this case, <\Delta data-recalc-dims=" /> is no longer zero but the other term is exactly the same. The expression becomes

 \frac{\partial P}{\partial t} = - \beta(2a-1)\frac{\partial}{\partial n}\left((C(n)+nA(n))P\right) + \beta^2 \frac{\partial^2}{\partial n^2}\left(\left(\frac{n^2}{2}A(n)+B(n)\right)P\right)


 C(n) = \frac{1}{\mathcal{N}} \int_n^\infty dn' P(n')n',

Note that now \beta is almost a relevant parameter as it can no longer be completely absorbed into the definition of time. The relevant parameter here that will control the dynamics will be \frac{2a-1}{\beta}.

The population distribution of the modified model with \frac{2a-1}{\beta} = 0.2. Given the blue initial conditions, the system evolves to the purple line which looks linear over a large number of ranks. The straight orange line has slope 0.8.

I solved this numerically in the same manner as before. In the above plot, the slope of the line is about 0.8, which is much closer to the population data. All this required was for moving to a big city to be about 10% more desirable than moving away from the city (though the exact percentage depends on the relationship between a and \beta). I'm sure by tuning the ratio \frac{2a-1}{\beta} further I could get it even closer to the 0.725 slope observed in the U.S. population.

There are also other factors that could be considered, such as the birth and death of city residents as well as immigration. [2] describes how to take these into account, but for the sake of simplicity I will not consider it here. An interesting aspect is that including just one of these cannot change the behavior (other than the size of the population increasing or decreasing), though adding both can lead to different behavior.

As I was writing this, I was trying to think whether something like crime could be modeled as a similar way, but I couldn't think of a reasonable exchange (like the \Delta expression) that could be written. It would be interesting to look into how Zipf's law is emergent in other systems as well.

The code is available here. The code runs pretty slowly, it has not been optimized. The easiest optimization would probably be to do the computations on a GPU using pyCUDA. I'll leave that as an exercise to the reader ;).

1. Reed, William J.; et al. 2004. The Double Pareto-Lognormal Distribution – A New Parametric Model for Size Distributions. Communications in Statistics – Theory and Methods 33 (8), 1733–1753.
2. Boghosian, Bruce M. 2014. Kinetics of wealth and the Pareto law. Physical Review E 89, 042804.
3. Boghosian, Bruce M. 2014. Fokker-Planck Description of Wealth Dynamics and the Origin of Pareto’s Law. arXiv.