# Optimization Functions in Deep Learning

This will be a short post about optimization functions in deep learning. For simplicity, I'll show how these optimization functions work with a simple 1D example. Suppose that the loss function, $L,$ as a function of weight, $w,$ looks like the following.

The flat region for values of the weight between 1 and 2 is a little contrived, but this will help depict some of the advantages of the more complicated optimization functions. Saddle points are ubiquitous in a deep learning training landscape, so these are valid concerns to consider.

With the standard gradient descent described in a previous post, the optimization needs to entirely avoid the flat region. In this region, the gradient is zero, and thus no update to the weights can be made. As shown below, for positive starting values of the weight, it is hard to avoid the zero gradient region. Even a negative starting value can end up in the flat region if the learning rate is large enough.

Loss as a function of optimization steps starting with a weight of -10 (top) and a weight of 10 (bottom) for different learning rates. Note that the loss flattening out at $10^0$ is an artifact of the flat region of the loss function.

Decaying the learning rate as the optimization time increases also doesn't really help here. The biggest problem is the algorithm will get stuck in the flat region, and changing the learning rate won't fix this issue.

Momentum

This example shows that the crux of the problem is that the update goes straight to zero when a flat region is encountered. The idea behind momentum is to track a momentum variable that updates with each step.

In these expressions, a subscript denotes the vector component to allow the expressions to generalize to multiple dimensions. There is now a new parameter, $\alpha,$ that tracks how much memory of previous gradients the optimization process should have. Note that when $\alpha = 0,$ this reduces to standard gradient descent.

Implementing this on the function above, we see now that as long as the momentum parameter is large enough, the flat region can be avoided, even with the same learning rate as before. On top of that, the convergence is far faster than any of the choices above.

Loss as a function of optimization steps starting with a weight of 10 for different momentum parameters.

RMSprop

Typical deep learning models have many orders of magnitude more parameters (100000+). Suppose that during the training process, the gradients start out large for some of these variables but not others. There will be a significant update for those variables with large gradients. The gradients of the other parameters might become larger as the training proceeds. However, if the learning rate is decaying through the training process, these updates will not change the parameters as much as if these large gradients had appeared earlier in the training process.

Thus, RMSprop attempts to tune the learning rate for each parameter. If the gradient over the previous steps has been large for that parameter, the learning rate for that parameter will be smaller. This allows for smaller steps to refine the value of the parameter, as discussed in a previous post. Conversely, if the gradient over the last few steps has been small, we may want a larger learning rate to be able to move more in that parameter. We don't want the memory of the gradient sizes to be too long since it seems perverse for the first gradients to affect the 100000th step of training. Thus, a moving average is used to keep track of the gradient magnitudes. The updates of RMSprop are the following.

We see that $\sqrt{r_i}$ is a moving average of the gradient magnitude for each parameter. The moving average is characterized by the parameter $\rho.$ $\delta$ is a parameter chosen such that denominator in the fraction does not diverge. $\eta,$ as before is the learning rate of the problem. Global learning rate decay can also be combined with this method to ensure learning rates for each parameter are bounded.

Since the update is always directly proportional to the gradient, for the problem at hand, RMSprop doesn't help and will get stuck much like traditional gradient descent.

If momentum works well, and RMSprop works well, then why not combine both? This is the idea behind Adam. The gradient descent updates of Adam are the following.

The two moving averages, corresponding to the momentum and gradient magnitude, are tracked and used for the update. Adam is my go-to optimizer when training deep learning models. Typically the only parameter that needs tuning is the learning rate, $\eta,$ and the default values of the other parameters ($\delta=10^{-8},\alpha=0.9,\rho=0.999$) tend to work really well. I applied Adam to the 1D example loss function with these default parameters and got some decent results. This looks a little worse than momentum, but this is because this problem is so simple. For most real-world problems Adam outperforms momentum.

Loss as a function of optimization steps using Adam optimizer with default parameters starting with a weight of 10 for different learning rates.

# Amtrak and Survival Analysis

I got the idea for this blog post while waiting ~40 minutes for my Amtrak train the other week. While I use Amtrak a handful of times a year, and generally appreciate it I do find it ridiculous how unreliable its timing can be. (This is mostly due to Amtrak not owning the tracks they use, but I will not get into that here). Most of the delays I've experienced lately haven't been too bad (40 minutes is on the high end), but when I lived in North Carolina and was often taking the Carolinian train from Durham to Charlotte, the story was quite different. I can't remember a single time when the train arrived on time to Durham, and often the delays were over an hour.

This brings me to the point of this post, which is to answer the question, when should I get to the Durham Amtrak station if I want to catch the Carolinian to Charlotte? I'll assume that real-time train delay data isn't available so that past information is all I have to go off of. Certainly if all the trains are actually an hour late, I might as well show up an hour past the scheduled time and I would still always catch the train. Amtrak claims the Carolinian is on time 20-30% of the time, so presumably showing up late would make you miss about this many trains.

Fig. 1: Delay of arrival of the 79 train to the Durham station for each day since 7/4/2006 (with cancelled trains omitted). Note that in the first year and a half of this data, there are almost no trains that arrive on time, but the situation has improved over the years.

All Amtrak arrival data since 7/4/2006 is available on this amazing website. I got all the data available for the 79 train arriving to the Durham station. I've plotted the arrival times during this time in Fig. 1.

A simple frequentist approach

I can consider each train trip as an "experiment" where I sample the distribution of arrival times to the Durham station. The particular train I take is just another experiment, and I would expect it to follow the available distribution of arrivals. Thus, the probability of me missing the train if I arrive $\tau$ minutes after the scheduled time is

Where $N(t>t')$ counts the number of arrivals in the available data where the arrival $t'$ is greater than the specified $\tau.$ The question, then, is how much of the data to include in $N(t>t').$ To test this, I considered a half year's worth of data as a test set. Then, I figured out how much of the previous data I should use as my training set to most accurately capture the delays in the test set. I found that using a year of data prior to the prediction time worked the best. The method is not perfect; the percentage of missed trains predicted using the training set is about 10% off from the number in the test set, as there are year-to-year variations in the data.

A plot of $p(\tau)$ using all the available data and only using the last year of data is shown in Fig. 2. Using only the last year to build the model, to have an 80% chance of making the train, one can show up about 20 minutes after the scheduled time. This also confirms Amtrak's estimate that their trains are on time 20-30% of the time. Even if one shows up an hour after the scheduled time, though, he or she still has a 36% chance of making the train!

Fig. 2: $p(\tau)$ determined using all of the available data (blue) and only the last year of data (gold). I see that for delays longer than 60 minutes, the two curves are similar, indicating that for long waits either prediction method would give similar results. It appears that in the last year the shorter delays have been worse than the long-term average, as there are significant discrepancies in the curves for shorter delays.

A Bayesian approach

With a Bayesian approach, I would like to write down the probability of a delay, $\delta,$ given the data of past arrivals, $\mathcal{D}.$ I will call this $p(\delta|\mathcal{D}).$ Suppose I have a model, characterized by a set of $n$ unknown parameters $\vec{a}$ that describes the probability of delay. I will assume all the important information that can be extracted from $\mathcal{D}$ is contained in $\vec{a}.$ Then, I can decompose the delay distribution as

Using Bayes theorem, $p(\vec{a}|\mathcal{D})$ can then be expressed as

Here, $p(\mathcal{D}|\vec{a})$ is the likelihood function (the model evaluated at all of the data points), $\pi(\vec{a})$ is the prior on the model parameters, and $p(\mathcal{D})$ is the evidence that serves as a normalization factor.I use non-informative priors for $\pi(\vec{a}).$

The question is, then, what the model should be. A priori, I have no reason to suspect any model over another, so I decided to try many and see which one described the data best. To do this, I used the Bayes factor, much like I used previously, with the different models representing different hypotheses. The evidence for a model $\mathcal{M}_1$ is $f$ times greater than the evidence for a model $\mathcal{M}_2$ where

As the models are assumed to depend on parameters $\vec{a}$ (note that a method that does not explicitly have a functional form, such as a machine learning method, could still be used if $p(\mathcal{D}|\mathcal{M})$ could be estimated another way)

Here, $\delta_i$ are all of the delays contained in $\mathcal{D}.$ This integral becomes difficult for large $n$ (even $n=3$ is getting annoying). To make it more tractable, let $l(\vec{a}) = \ln(p(\mathcal{D}|\vec{a})),$ and let $\vec{a}^*$ be the value of the fit parameters that maximize $l(\vec{a}).$ Expanding as a Taylor series gives

where $H$ is the matrix of second derivatives of $l(\vec{a})$ evaluated at $\vec{a}^*.$ The integral can be evaluated using the Laplace approximation, giving

which can now be evaluated by finding a $\vec{a}^*$ that maximizes $p(\mathcal{D}|\vec{a}).$ (Regular priors much be chosen for $\pi(\vec{a}^*|\mathcal{M})$ since I have to evaluate the prior. I will ignore this point here). I tested the exponential, Gompertz, and Gamma/Gompertz distributions, and found under this procedure that the Gamma/Gompertz function described the data the best under this metric. Using this, I explicitly calculate $p(\delta|\mathcal{D}),$ again under the Laplace approximation. This gives the curve shown in Fig. 3, which, as expected, looks quite similar to Fig. 2.

While this section got a bit technical, it confirms the results of the earlier simple analysis. In particular, this predicts that one should show up about 17 minutes after the scheduled arrival time to ensure that he or she will catch 80% of trains, and that one still has 30% chance of catching the train if one shows up an hour late to the Durham station.

Fig. 3: $p(\delta|\mathcal{D})$ calculated using only the last year of data. Note that this curve is quite similar to Fig. 2.

Conclusions

Since I downloaded all the data, 5 days have passed and in that time, the delay of the 79 train has been 22, 40, 51, 45, and 97 minutes. It's a small sample size, but it seems like the prediction that 80% of the time one would be fine showing up 20 minutes late to the Durham station isn't such a bad guideline.

Of course, both of these models described are still incomplete. Especially with the frequentist approach, I have not been careful about statistical uncertainties, and both methods are plagued by systematic uncertainties. One such systematic uncertainties is that all days are not equivalent. Certain days will be more likely to have a delay than other. For example, I am sure the Sunday after Thanksgiving almost always has delays. No such patterns are taken into account in the model, and for a true model of delays these should either be included or the effect of such systematic fluctuations could be characterized.

Scripts for this post are here.

# Hypothesis Testing in a Jury

I recently served on a jury and was quite surprised as how unobjective some of the other jurors were being when thinking about the case. For our case, it turned out not to matter because the decision was obvious, but it got me thinking about a formal reasoning behind "beyond a reasonable doubt." This reasoning will involve more statistics than physics, but considering I've been thinking about Bayesian analyses recently in my research, it's quite appropriate.

At the most basic level, a jury decision is a hypothesis test. I wish to distinguish between the hypotheses of not guilty (call it $\mathcal{H}_0,$ since the defendant is innocent until proven guilty) and guilty (call it $\mathcal{H}_1$). In Bayesian statistics, the way to compare two hypotheses is by computing the ratio of posterior probabilities.

Where $p(\mathcal{H}|D)$ is the probability of the assumption $\mathcal{H}$ given the available data (the evidence). This probably does not seem obvious to compute, and I'll discuss later how one might determine values for these. If $F=2,$ then the hypothesis $\mathcal{H}_1$ is twice as likely as the hypothesis $\mathcal{H}_0$. Thus, if $F \gg 1,$ then the evidence for $\mathcal{H}_1$ is overwhelming. What this means in terms of "beyond a reasonable doubt" is debatable, but it is generally accepted that if $F \gtrsim 100,$ there is strong evidence for $\mathcal{H}_1$ over $\mathcal{H}_0$ [1]. Similarly, if $F\ll 1,$ then the evidence for $\mathcal{H}_0$ is overwhelming. Thus, the question or determining guilt or not guilt is equivalent to calculating $F.$

$p(\mathcal{H}|D)$ can be rewritten using Bayes theorem as

Here, $p(D|\mathcal{H})$ is the probability of the evidence given the assumption of guiltiness (or not-guiltiness), which is more tractable than $p(\mathcal{H}|D)$ itself. Note the prosecutor's fallacy can be thought of as confusing $p(D|\mathcal{H})$ with $p(\mathcal{H}|D).$ $p(\mathcal{H})$ is the prior, which takes into account how much one believes $\mathcal{H}$ with regard to other hypotheses. $p(D)$ is a normalization factor to ensure probabilities are always less than 1. In the relation for $F,$ this cancels out, so there is no need to worry about this term. With this replacement, the ratio of posterior probabilities becomes

The first ratio is called the Bayes factor. The second ratio quantifies the ratio of prior beliefs of the hypotheses. The ratio is, given no other information, the odds that the defendant is guilty. Suppose I accept that there was a crime committed, but that the identity of the criminal is in question. If there is only one person that committed the crime, this would then be the inverse of the number of people who could have committed the crime.

Now, I will consider the calculation of the Bayes factor for a real trial. Consider R v. Adams, which set a precedent for banning explicit Bayesian reasoning in British Courts in the context of DNA evidence. It was estimated during the trial that there were roughly 200,000 men in the age range 20-60 who could have committed a crime. Note that some extra assumptions on age and gender seem to be made here, so this does not seem applicable to the ratio of prior beliefs. However, if I lifted these restrictions, the Bayes factor for the victim's misidentification of the defendant would change accordingly, so this is not a concern.

First, consider the DNA evidence that was the only piece of evidence incriminating the defendant. Call this evidence $D_0.$ $p(D_0|\mathcal{H}_1)$ is the probability of a positive DNA match under the assumption that the defendant is guilty. This is presumably extremely close to 1, or DNA evidence would not be considered good evidence in trials. $p(D_0|\mathcal{H}_0)$ is the probability of a positive match if the defendant is not guilty. Taking into account the population of the U.K., this was estimated in the trial to be between 1 in 2 million and 1 in 200 million (though possibly as low as 1 in 200 since the defendant had a half-brother) [2]. Thus, the Bayes factor considering only the DNA evidence, is between 2 million and 200 million. With the 1 in 200,000 prior probability, the posterior probability ratio $F$ is between 10 and 1000. Only the higher end of this range is overwhelming evidence, and in the case of conflicting evidence, the jury is supposed to give the benefit of the doubt to the defendant, so it seems a "not guilty" verdict would have been appropriate.

Further, this ignores all of the other evidence that helped to prove the defendant's innocence. This included the victim failing to identify the defendant as the attacker and the defendant having an alibi for the night in question. Let us call these two pieces of evidence $D_1$ and $D_2.$ Unlike the DNA evidence, the witnesses do not explicitly mention what the relevant probabilities are for this evidence, so it is up to the jurors to make reasonable estimates for these quantities. $p(D_1|\mathcal{H}_1)$ is the probability the victim fails to identify the defendant as the attacker given the defendant's guilt. Set this to be around 10%, though police departments may actually have statistics on this rate. On the other hand, $p(D_1|\mathcal{H}_0),$ the probability the victim fails to identify the defendant as the attacker given the defendant is not guilty is high, say around 90%. Thus, the Bayes factor considering the victim's failed identification of the defendant is about 1 in 10. Note that even if these numbers change by 10% this factor doesn't change in order of magnitude, so as long as a reasonable estimate is made for this factor, it doesn't really matter what the actual value is. The alibi is less convincing. Though the defendant's girlfriend testified, the defendant and the girlfriend could have confirmed their story with each other. Thus, I estimate the Bayes factor for the alibi $p(D_2|\mathcal{H}_1)/p(D_2|\mathcal{H}_0)$ to be about 1 in 2.

Since all these pieces of evidence are independent, $p(D_0,D_1,D_2|\mathcal{H})=p(D_0|\mathcal{H})p(D_1|\mathcal{H})p(D_2|\mathcal{H}).$ Thus, the Bayes factor for all evidence is between 100,000 and 10,000,000. Now, multiplying by the prior probability, this gives a posterior probability ratio, $F,$ between 0.5 and 50. With the new evidence taken into account, there is no longer strong evidence that the defendant is guilty even in the best case scenario for the prosecution. Convicting someone with a posterior probability ratio of 50 would falsely convict people 2% of the time, which seems like an unacceptable rate if one is taking the notion of innocent until proven guilty seriously. Note that as long as the order of magnitude of each of the Bayes factor estimates doesn't change, the final result will also not change by more than 1 order of magnitude, so the outcome is fairly robust.

While this line of logic was presented to the jurors during the trial, the jurors still found the defendant guilty. The judge took objection to coming out with a definite number for the odds of guilt when the assumptions going into it are uncertain, though as argued before, for any reasonable choice, these numbers cannot change too much [2]. It seems that without formal training in statistics, it was difficult to accept these rules as "objective," even though this is a provably, well-defined, mathematical way to arrive at a decision. If common sense and the rules of logic and probability are really what jurors are considering to reach their decision, this has to be the outcome that they reach. [3] argues that to not believe the outcome of a Bayesian argument like this would be akin to not believing the result of a long division calculation done on a calculator.

The most common objection to Bayesian reasoning (though apparently not the one the judges in R v. Adams had) is that the choice of prior can be somewhat arbitrary. In the example above, in the estimation of people that could have committed the crime, I could take people who were on the same block, the same neighborhood, the same city, or maybe even the same state. Each of these would certainly give different answers, so care must be taken to choose appropriate values for the prior. This doesn't make the method wrong or unobjective. It is just that the method cannot start with no absolutely no assumptions. Given a basic assumption, though, it provides a systematic way to see how the basic assumption changes when all the evidence is considered.

This problem seems to stem from a misunderstanding of basic statistics by the jury, attorneys, and judges. Another example of this is the U.K. Judge Edwards-Stuart who claimed that putting a probability on an event that has already happened is "pseudo-mathematics" [4]. This just shows the judge's ignorance, as this is precisely the type of problem Bayesian inference can explain. It is a shame that in the U.K. Bayesian reasoning is actively discouraged due to R v. Adams, as this is the only rigorous way to deal with these types of problems. I wasn't able to find any specific examples in the U.S., but I assume the "fear" of Bayesian statistics in courts here is similar to the case in the U.K.

References
1. Jeffreys, H. 1998. The Theory of Probability.
2. Lynch, M. and McNally, M., 2003. “Science,” “common sense,” and DNA evidence: a legal controversy about the public understanding of science. Public Understanding of Science, 12-1, 83-103.
3. Fenton, N. and Neil, M., 2011. Avoiding Probabilistic Reasoning Fallacies in Legal Practice using Bayesian Networks. Austl. J. Leg. Phil., 36, 114-151.
4. Nulty and Others v. Milton Keynes Borough Council, 2013. EWCA Civ 15.

# Resistors and Distance on Graphs

I feel bad for not having written a post in so long! I have been busy with teaching, research, and various other projects. Now that my teaching duties are done, I will try to post more regularly!

A graph is a collection of nodes that are connected by edges that are drawn under a defined condition. These edges may or may not have weights. Graphs are useful representations of data in many scenarios. For example, the Internet is an example of a graph. Each web page would be a node and an edge would be drawn between two nodes if one of the pages links to the other. These edges could be unweighted or could represent the number of links between the two pages. Another example is social media. For example, the nodes could be all users of the service and an edge would be drawn if two nodes are "friends" with each other.

One way to define a distance between two nodes on the graph may be to find the shortest path between the nodes. This would be defined as the collection of edges from one node to the other such that the sum of the weights of the edges (or simply the number of edges in the case of an unweighted graph) is minimized. This is what LinkedIn seems to do when they compute the "degree" of connection of a stranger to you. This is also how Bacon and Erdös numbers are defined.

Fig. 1: Two graphs with the same shortest path distance between A(lice) and B(ob). However, it is clear that in the right graph A(lice) seems better connected to B(ob) than in the left graph.

A shortcoming of this measure, though, is that shortest path ignores how many paths there are from one node to the other. This scenario is depicted in Fig. 1. Suppose in a social network, we would like to determine something about how likely it is that one person (say Alice) will meet another (say Bob) in the future. Suppose Bob is a friend of only one of Alice's friends. Then, given no other information, it seems unlikely that Alice would ever meet Bob, since there is only one avenue for Alice to meet Bob. Of course, this could be different if Bob was good friends with Alice's significant other, so we might want to weight edges if we have information about how close Alice is to her friends in the social network. Now, if Bob is friends with half of Alice's friends, it seems quite likely that when Alice goes to a party with her friends or is hanging out with one of those friends, then Alice will run into Bob. In both of these cases, the shortest path distance between Alice and Bob is the same, but the outcome of how likely they are to meet (which is what we want to analyze) is different.

It turns out that a useful analogy can be made by considering each edge as a resistor in a resistor network. In the unweighted case, each edge is a resistor with resistance 1 (or $1~\Omega$ if having a resistance without units bothers you, though units will be dropped for the rest of the post) and in the weighted case, each edge is a resistor with resistance equal to the inverse weight of the edge. We will see that the effective resistance between two nodes is a good measure of distance that more accurately captures the scenario described above. Groups of nodes with small effective resistance between them will correspond to clusters of people (such as people who work in one workplace) in the social network.

The effective resistance satisfies the triangle inequality, which is a defining property of distances [1]. We can see this as follows. The effective resistance between nodes $a$ and $b$ is the voltage between the two nodes when one unit of current is sent into $a$ and extracted from $b.$ Let the voltage at $b$ be zero (we are always free to set zero potential wherever we like). Then $R_{ab} = v_a = (v_a-v_c)+v_c.$ Now, we know that $v_a-v_c \leq R_{ac},$ since the potential should be maximal at the source (current doesn't climb uphill). Similarly, $v_c\leq R_{cb}$ (remember node $b$ is grounded). Thus, we have that $R_{ab} \leq R_{ac}+R_{cb},$ and the triangle inequality holds. The other defining characteristics hold quite trivially, so the effective resistance is a reasonable way to calculate distance.

Now consider the example presented before and depicted in Fig. 1. If Bob is only friends with one of Alice's friends, and there are no other links between Alice and Bob, then the effective resistance between Alice and Bob is 2. In this case, the effective resistance is the same as the shortest path distance. If Bob is friends with 7 of Alice's friends, and there are no other links between Alice and Bob, the effective resistance between Alice and Bob is 0.29. So, we see that this satisfies the property that having one connection is not as important as having many connections to Bob.

Another interesting consequence of this analogy is related to random walks. Suppose a walker starts at a node $v$, and chooses a random edge connected to that node to walk along (the random choice is weighted by edge weights if the graph is weighted). Then, the expected number of steps to start at $v,$ go to another node $w,$ and return back to $v$ (which is called the commute time) is proportional to the effective resistance between $v$ and $w.$ One way to think about this is that a way to estimate effective resistance, then, on a large resistor network would be through the use of random walks [2]. This is interesting for tricky resistor problems such as the infinite grid of resistors. This also reinforces the idea that effective resistance is a good way of quantifying communities. When one node is well connected with another, it should be relatively easy to commute between the nodes, and thus they should be part of the same community. Further, this notion can be used to place bounds on the maximum possible commute time.

Effective resistance has many other uses as well. Quickly computing effective resistance turns out to be quite useful in speeding up an algorithm that solves the maximum flow problem [3]. In addition, sampling edges by weights proportional to the effective resistance across the edge yields an algorithm that sparsifies graphs. This means that the new graph has fewer nodes and edges than the original graph, but looks "close" to the original graph, and is a more efficient representation of that information [4].

References
1. Lau, L.C., 2015. Lecture notes on electrical networks.
2. Ellens, W., 2011. Effective Resistance.
3. Christiano, P., et al., 2010. Electrical Flows, Laplacian Systems, and Faster Approximation of Maximum Flow in Undirected Graphs.
4. Spielman, D. and Srivastava, N., 2011. Graph Sparsification by Effective Resistances. SIAM J. Comput., 40(6), 1913–1926.

# Ferguson and Game Theory

Note: In this post I use words like "player," "game," and "payout" to refer to the interactions between criminals and police officers. I do not mean to trivialize deadly encounters between criminals and police officers and I am not implying that police officers are playing a game at their jobs. These are the terms that are used in discussions of game theory and to stay consistent with this, I use these terms.

In this blog, I usually talk about how ideas of physics apply to studying problems "outside physics". As far as I know, game theory isn't too useful in physics, but physicists have contributed to the theory of it as in [1]. Game theory is an interesting way to mathematically model interactions of people.

In a game theory game such as the prisoner's dilemma, each player has a choice of cooperating or defecting with the opponent. If the player cooperates while the other cooperates, the two players both get a mid sized reward. If one player cooperates while the other defects, that player gets nothing while the opponent wins a large reward. In the player defecting, opponent defecting case, the converse occurs. If both players defect, they both get a small reward. While seemingly simple, there is a rich theory to games like these. In particular, the iterated prisoner's dilemma (where this game is repeated many times) is used to study population dynamics.

In a game, the rewards (or payouts as they will be referred to later) are characterized by a payout matrix. The payout matrix shows, for each player, what his or her payout will be given each of the possible outcomes.

Fig. 1: Examples of payout matrices for each player in the prisoner's dilemma. If both players cooperate, they both get a payout of 3. If both players defect, they both get a payout of 1. If one player defects while the other cooperates, the defector gets 5 while the cooperator gets 0. Note that for the prisoner's dilemma, the game looks the same for each player. In the game I consider below, this is not the case.

We will apply this type of game theory logic to whether a police officer should shoot a criminal or not. This game is more complicated than the classical prisoner's dilemma as the payoff is not the same to either player (police officer or criminal), even if we assume each player values their life the same. Further, the criminal's payoff matrix is certain, while the police officer's payoff matrix has some uncertainty.

The scenario I am imagining is the following, inspired by the Ferguson incident. A criminal has committed a crime and is being chased down by a (or a pair of) police officer(s). I will take cooperating to mean that one player does not attack the other player, and defecting to mean that one player does attack the other player. I assume that if a player has a gun and defects, the other player dies, though most of the following argument would also follow if instead of dying the player is severely injured. For payouts, I assume a hierarchy where the payout for nothing happening >> the payout for arrest, suspension >> the payout for death (severe injury). I will take specific values of 0,-100,-10000 for these payouts mostly to make the arguments easier to understand, but the general results should hold regardless of the specific values chosen, as long as this hierarchy is in place. I assume that the game is played only once between the criminal and police officer(s), but it would be possible to think of this as an iterated game, where both players have to assess at each stage how much danger they are in and determine to defect or cooperate.

The payout matrix for the law enforcement officer is uncertain. The payout looks like the following, where the top consequences are the payouts when the criminal has a gun (or a lethal weapon) and the bottom are consequences when the criminal does not.

Fig 2: Payout matrix for the law enforcement officer. The top consequences are the consequences when the criminal has a lethal weapon, whereas the bottom consequences are when the criminal does not.

I would like to assign numerical values to these scenarios. In order to average the two scenarios (criminal having a lethal weapon and not) I will assume a prior distribution. This would be the fraction of criminals that are arrested by police officers that are carrying lethal weapons. This prior distribution presumably depends on the type of the crime the criminal committed, but I will not take this factor into account in the simple model I am considering. I was not able to find information about this distribution easily (though I hope this information is available to law enforcement officers) so I assume that 50% of criminals are carrying lethal weapons. However, I expect that this is an overestimate.

Fig 3: Numerical values for the payout matrix for the law enforcement officer. The values are obtained by averaging the payouts of the each of the consequences weighted by the probability that the criminal has a lethal weapon.

Thus, if the law enforcement officer knows that the criminal is cooperating, the optimal thing for him or her to do is to cooperate as well, as this is what maximizes the payout. Unfortunately, there is a chance the criminal will defect. In this case, the police officer should still cooperate, as there is a chance the criminal does not have a lethal weapon. Then, the officer has no consequences, which is better than facing suspension. Note, however, that if the criminal is defecting, the difference in payout for the two strategies of the police officer is quite small (0.1% with the numbers I have chosen). The two strategies are almost equivalent to the police officer.

Now consider the criminal. The payout matrix for the criminal is certain, as it is common knowledge that law enforcement officers carry weapons. It would look like the matrices below.

Fig 4: Descriptive and numerical payout matrices for the criminal. We take the criminal defecting, police officer cooperating case to be -1000 as it is part way between -100 and -100000. This may be an underestimate as if a criminal manages to shoot a police officer, there is a chance that another police officer will shoot the criminal.

The interesting thing here is that if the police officer is defecting (which, recall is not such a bad strategy for the police officer if the police officer assumes the criminal will defect), there is no optimal strategy for the criminal. The criminal dies in both cases. Clearly then, as there is no strategy if the police officer defects, the criminal can only optimize the case when the police officer cooperates, meaning the criminal should cooperate. Thus, the Nash equilibrium is for the criminal and police officer to cooperate. This is presumably what happens most of the time. Criminals are arrested every day but we (comparatively) rarely hear about incidents where the criminal or police officer is killed. So does this mean game theory cannot explain what happened at Ferguson? Not quite.

Let's say the strategy of the police officer is to defect a fraction $x$ of the time and to cooperate $1-x$ of the time. Then the expected payout (probability multiplied by the payout of that case) for the criminal is $-100(1-x)-100000x$ if the criminal cooperates and $-1000(1-x)-100000x$ if the criminal defects. As an example, take $x$ to be $0.01$. At this value, the expected payout for the criminal is $-1099$ if he or she cooperates and $-1990$ if he or she defects. Clearly, the defecting strategy is still worse, but the difference between cooperating and defecting is only about 50%. The difference would be even less if the fraction $x$ were higher or if the payout of death were even more negative. Thus, if there's any reason for the criminal to think that the police officer may defect (and given news reports of incidents like Ferguson, there is probably reason for the criminal to expect this), the difference between cooperating and defecting is not so different in terms of payout. This means the actions of the criminal become unpredictable, which means the police officer may need to defect to preserve his or her life.

The main cause of this issue is that dying is really bad. Thus, if there's even a small chance that the police officer will defect, the criminal's choices become mostly equivalent. Now suppose police officers are only able to use non-lethal means against the criminal. Then the payout matrix for the criminal looks quite different.

Fig 5: Descriptive and numerical payout matrices for the criminal when the police officer cannot use lethal means against him or her. I choose a value of -200 for getting hurt, though it may not necessarily be this bad. If the criminal can prove that the police officer attacked him or her without just cause, the criminal may even receive some sympathy.

Now, if the criminal defects, the payout is -1000 no matter what, while if the criminal cooperates the expected payout can be anywhere between -100 and -200, depending on the police officer's strategy. The expected payout is no longer (approximately) degenerate as above. The clear choice here is to cooperate, as defecting is many times worse. If the criminals and police officers are acting rationally, then an assurance by law enforcement to not kill criminals seems to improve the rate that the criminals should cooperate. Of course, the problem with this is, that while the criminal's payoff matrix has changed by ensuring the police officer will not kill the criminal, the police officer's payoff matrix remains the same.

As mentioned earlier, unless the probability the criminal has a lethal weapon is extremely small, if the criminal has a chance to defect, the expected payout for the police officer is also nearly degenerate no matter the police officer's strategy. The police officer's situation could be improved though, if the chance of death of the police officer were smaller. This could include protective equipment and training to not get fatally wounded in these encounters. In addition, there is a crucial difference between the criminal and police officer in that the police officer has more experience with these encounters than the criminal does (though the police officer is disadvantaged in that he or she has less information about the opponent). Thus, it is imperative that the police officer is trained to make the right decisions in these tough situations. As stated before, this is presumably what happens most of the time when criminals get arrested, and thus the police officers are usually doing a good job. If the payoff to the police officer were truly degenerate, we would expect the police officers to defect about half of the time. However, there are incidents where the police officer unnecessarily perceives danger. The police officers should be aware aware of their risks so that tragedies do not occur.

The payoff matrix for the police officer also looks quite different based on the prior distribution (or expectation) of the criminal's likelihood to be carrying a lethal weapon. I assume I have overestimated the likelihood a criminal carries a lethal weapon, but with this information the police officer's "strategy" could be improved, and as I stated above, this could prevent people from unnecessarily dying.

Most of the analysis above has depended on the players of the game acting at least approximately rationally. It is an interesting question, though, how rationally we can expect the players in this game are playing. The criminal presumably is "on edge" from just having committed a crime and being chased down by police officers would probably not help the situation. In a study done with the game show Golden Balls, contestants were found to be more altruistic, in general, than a simple game theory analysis of rational players would predict [2]. This shows that even in scenarios where life or death is not on the line, people may not be expected to act rationally. With information about how criminals are likely to act, police officers can develop better strategies on how to deal with situations such as this one.

References
1. Press, W.H., 2012. Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent. Proc. Natl. Acad. Sci. 109-26, 10409–10413.
2. Van den Assem, M.J., 2012. Split or Steal? Cooperative Behavior When the Stakes Are Large. Management Science. 58-1, 2-20.

# Pedestrian dynamics

I've written about modeling the movement of cars as a fluid in the past. We could think about pedestrians like this, but usually pedestrians aren't described well by a fluid model. This is because while cars are mostly constrained to move in one direction (in lanes), this is not true of pedestrians. On a sidewalk, people can be walking in opposite directions, and often someone walking in one direction will directly interact (by getting close to) someone walking in the opposite direction. There are some specific scenarios where a fluid model could work, such as a crowd leaving a stadium after the conclusion of a basketball game. In this case, everyone is trying to get away from the stadium so there is some kind of flow. However, this doesn't work generally, so I will consider a different type of model, similar to the one described in [1].

If there are only two directions that people want to travel and they happen to be opposite, then we could model the pedestrians as charged particles. The pedestrians that want to go in opposite directions would be oppositely charged, and the force that keeps the pedestrians on a trajectory could look like an electric field. However, this would mean that people moving in opposite directions would attract each other, which really does not match expectations. This model also fails if there are multiple directions where pedestrians want to go (such as at an intersection), or if the desired directions of the pedestrians are not opposite. While a plasma (which is a collection of charged particles) model may not be the best to describe the scenario, I will borrow some ideas from plasma dynamics in building my model and I will use techniques used to simulate plasmas to simulate pedestrian movement.

There will be a few effects governing the movement of pedestrians. One effect is for the pedestrians to want to go a desired direction at a desired speed. It turns out that most humans walk at a speed of around $v_d=$1.4 m/s (3.1 mph) and if someone is going slower or faster than this, they will tend to go toward this speed. Let me call the desired speed along the desired direction of the pedestrian $\vec{v}_d,$ and the current walking speed of the pedestrian $\vec{v}.$ I will model the approach to the desired direction as a restorative force that looks like

Here, $\tau$ represents how long it takes the pedestrian to get back to their desired position and direction once they are off track. In general, $\tau$ could be different for every pedestrian, but for simplicity I set it as a constant for all pedestrians here, and I will take it to 0.3 s, which is close to the human reaction time. Note that the restorative force is zero when $\vec{v_d}=\vec{v},$ so if a pedestrian is already going in their desired direction at their desired speed, there will be no restorative force and the pedestrian will continue to go at this direction and speed. You may find it odd that my force has units of acceleration. I am thinking about this more as a generalized sense of the term force as in something that causes velocity changes, but it would also be reasonable to assume that I have set the mass of the pedestrians to 1.

Pedestrians will also avoid colliding with each other, which is the other force I include in the model. While [1] assumes an exponential force for the interaction force, I will assume that pedestrians interact via a generalized Coulomb potential. The general results seem to match without too much regard for the exact shape of the force. I define the force between pedestrian $i$ and pedestrian $j$ is

Where $\vec{r}_{ij} = \vec{r}_i - \vec{r}_j$. $\gamma,$ $\epsilon,$ $\alpha,$ and $r_0$ are constants that I will describe below. $r_0$ is an interaction radius that sets the scale for this interaction. This would not necessarily be the same for everyone. For example if someone is texting, their interaction radius $r_0$ is probably much smaller than someone who is paying attention to where they are going. However, for simplicity I take it to be the same for everyone, and I take it to have a value of 1.2 m.

Since pedestrians travel in 2 dimensions, if $\alpha = 1$ and $\epsilon = 0,$ this would be the Coulomb potential, if $\gamma$ were aptly chosen. In this scenario, however, I do not really want the Coulomb potential. The Coulomb potential is quite long range, meaning that particles in a Coulomb potential can influence particles that are quite far away. As the power $\alpha$ in the equation above gets larger, the force becomes more short-range, which seems to better model the interactions of pedestrians. However, this presents another problem in that the force gets extremely large if two pedestrians happen to get really close to one another. To combat this, $\epsilon$ is a small number that "softens" the force such that the force never gets extremely large (which I took to mean $|\vec{F}_{ij}|$ should never be too much bigger than the maximum possible value of $|\vec{F}_{restore}|$). $\gamma$ then decides the relative importance of this interaction force to the restorative force.

I will simulate this model by considering $N$ people are in a long hallway with aspect ratio 1:10, for example at an airport or a train station. This can also be a model for a long, wide sidewalk as even though there are no walls, people are relatively constrained to stay on the sidewalk. I have some people trying to get to one end of the hallway (in the $+\hat{x}$ direction) and some trying to get to the other end (in the $-\hat{x}$ direction). This is an example of an N-body simulation, which is widely used in studying gravitational systems and plasma systems.

In [1], the walls exerted an exponential force on the pedestrians. I choose a similar model. I set the parameters of the exponential empirically such that the pedestrians keep a reasonable distance from the walls. I set the range of the exponential force to be a tenth of the total width of the corridor. I set the magnitude such that at the maximum, the force due to the wall is the same as the maximum value of $|\vec{F}_{restore}|.$

When a pedestrian reaches the end of the corridor, I record how much time it took for that pedestrian to traverse the corridor. I then regenerate the pedestrian at the other end of the corridor as a new pedestrian. I generate the new pedestrian with a random $y$ coordinate and a random velocity direction, but pointing at least a small bit in the desired direction. The magnitude of the velocity is taken to be $v_d.$ Thus, the simulation is set up such that there will always be $N$ people in the hallway.

A simulation of $N$=100 people in a hallway of dimensions 100 m x 10 m. All pedestrians desire to go to the left of the hallway. The pedestrians relax to a state where they are each about the same distance from each other. It seems that people usually stand closer together on average, so our value of $r_0$ should probably be smaller to match observations.

The first thing I tried was to simply put a few people in the hallway all wanting to go in the same direction, and see what they do. I set the length of the hallway to be 100 m, which made the width of the hallway 10 m. As can be seen above, this isn't too exciting. The pedestrians' paths are mostly unobstructed and they get across the hallway in about 71 s, which is the length of the hallway divided by 1.4 m/s, the comfortable walking speed of the pedestrians. Even in this simple case, though, it is apparent that the pedestrians "relax" into a scenario where the average distance between the pedestrians is roughly the same.

A simulation of $N$=100 people in a hallway of dimensions 50 m x 5 m. Pedestrians are equally likely to want to go left or right. We can see that lanes of people that would like to go in the same direction can form, as was observed in [2]. This effect could be even stronger with an extra "incentive force" for people to be on the right side of the road if they are not already on that side.

The time required to cross the room as a function of density of people in a simulation of $N$=100 people. The y-axis is normalized by the length of the room divided by the desired velocity (1.4 m/s). $p=0.5,$ which means half of the pedestrians desire to go to the left and the other half desire to go to right. I change the density of people by changing the size of the room from 2.5 m x 25 m to 10m x 100 m. As the density is higher, the pedestrians interact more with each other and thus are less likely to be on their desired trajectory.

Next, I looked at the more interesting cases of what happens when there are pedestrians that want to go in different directions. First, I assume that exactly half of the pedestrians would like to go in one direction and half would like to go the other direction. I then varied the length and width of the hallway, keeping the aspect ratio constant, while keeping the number of people in the hallway constant at 100. This has the effect of changing the density of people in the hallway. The y-axis on the graph above is normalized by $L/v_d,$ which is the time a pedestrian with all of his or her velocity in the desired direction would take. This shows that as the density increases, it takes longer (proportionally) for the pedestrians to get across the room. This makes sense as the pedestrians are interacting more often and thus cannot keep going in the desired direction.

A simulation of $N$=100 people in a hallway of dimensions 50 m x 5 m. 90% of pedestrians want to go left, while the other 10% want to go right. The right-going pedestrians undergo many interactions with the left-going pedestrians. In fact, if the left-going pedestrians were denser, this could look like Brownian motion.

The time required to cross the room as a function of $p$, the fraction of $N$=100 people that would like to go left or right. The y-axis is normalized by the length of the room divided by the desired velocity (1.4 m/s). The size of the room is 5 m x 50 m. The blue line is the time to get across for the pedestrians going leftward, and the red line in the time to get across for the pedestrians going rightward. As the fraction of pedestrians going leftward increases, it becomes easier for those pedestrians to get across, but it makes it harder for the pedestrians that would like to go in the opposite direction to get across more slowly.

I then took the number of people in the hallway to be 100 with the length of the hallway being 50 m and the width being 5 m. I observed what happened as I varied the fraction of pedestrians, $p$ that wanted to go in either direction. This effect is shown above. As $p$ is increased, the more dominant pedestrians can get through the corridor more quickly than the less dominant pedestrians. Again, this makes sense as when people go "against the gradient," they have to weave through people to try to get to the other side.

I will note that I have not done this simulation in the most efficient way. For every pedestrian, I calculate the interaction force with all the other pedestrians and add up all the contributions. It turns out one can average or sometimes even ignore the effect of pedestrians far away, which can make the code run about $1/N$ times faster.

The python and gnuplot scripts I used for the simulation and to create the plots are available here.

References:
1. Kwak, J., 2014. Modeling Pedestrian Switching Behavior for Attractions. Transportation Research Procedia. 2. 612-617.
2. Tao, X., 2011. A Macroscopic Approach to the Lane Formation Phenomenon in Pedestrian Counterflow. Chinese Phys. Lett. 28.

# Entropy and least squares

While most of the papers I reference in this blog are relatively new, this article will discuss an old idea, described in some detail in [1]. Entropy has been used as a concept in information theory for over half a century. One definition of the thermodynamic entropy is

Here $k_B$ is the Boltzmann constant and $p_i$ is the probability that the system is in the $i$th microstate. The sum is over all of the microstates, which is one particular configuration for the system. For example, if I were calculating the entropy of a gas molecule in a box, a microstate would be the gas molecule with a particular position and momentum.

Since this is an equation with probability in it, it seems natural to extend the idea to other fields that deal with probability. This is the idea behind the Shannon entropy of information theory and statistics. This entropy is defined in a similar way. If one has a discrete probability distribution, $p(x_i),$ then the Shannon entropy is

Here the sum is over all possible values of $x_i.$

The second law of thermodynamics states than an isolated system will always try to maximize its entropy. Thus, if I want to determine what the equilibrium configuration of a system is, I can do this by maximizing the entropy of the system. This is the approach [2] takes in "deriving" thermodynamics from basic statistical mechanics. The entropy is maximized subject to the constraint that the distribution should have a certain energy or particle number, and the Lagrange multipliers enforcing the constraints turn out to be related to temperature and chemical potential.

A similar idea can be applied to statistics. Suppose we would like for a probability distribution to have average $\mu$ and standard deviation $\sigma,$ but have as little other information encoded as possible. In other words, given a set of data points $x_i,$ I would like to find $p(x_i),$ the assignment of probability to each of the $x_i,$ such that the average $x_i$ value will be $\mu$ and the standard deviation is $\sigma^2$. I would do this by maximizing the Shannon entropy (by setting $dS = 0$) subject to some constraints. These constrains are

By setting $dS=0,$ the form of the equation for $p(x_i)$ becomes

Here, the $\lambda_j$ are the Lagrange multipliers. I then plug this form into the constraint equations to solve for the values of $\lambda_j.$ This gives

where $Z,$ the partition function, is

So I see the partition function is a useful object in the statistical sense as well. We can't simplify this any further without knowing specific values for $x_i,$ but given this information it would be easy to solve for the values of the Lagrange multipliers.

This procedure produces some reasonable results. For example, if the set of points is the entire real axis and I would like to apply the constraints above (though I have to do things a little differently since this is a continuous case), the distribution this procedure gives turns out to be a Gaussian. Thus, the Gaussian is a distribution over the whole real line that has a set average and standard deviation but encodes as little other information as possible.

There is a notion of "relative entropy" that may not be as familiar to physicists (at least I had never heard of it). This is called the Kullback-Leibler (KL) divergence. This can be quantified as (this is actually the negative of the KL divergence)

The KL divergence of a distribution $P$ from a distribution $Q$ quantifies how much information is lost when $Q$ is used to approximate $P.$ This seems like a nice thing to consider in the context of regression. I will follow [3] to use this to show how to compare two fit models and determine which one is more robust.

Let me assume there is some true distribution $f(y)$ and I am approximating it by a function $g(y|x).$ Now consider the expected entropy in $x$ (here $E_x$ will denote the expectation value with respect to $x$). This is

Now suppose there were another model, $g'(y|x).$ I would like to consider whether $g$ or $g'$ describes $f$ better. I can look at this by looking at the difference in the expected entropy of the two.

I have made a measurement of $\ln(g(y|x))-\ln(g'(y|x))$ by performing the fit, as this is a ratio of log-likelihoods. Asymptotically, (look at [3] for details) this measurement will differ from the expected value by $2(k-k'),$ where $k$ and $k'$ are the number of parameters used in the fit of $g$ and $g',$ respectively. Correcting for this bias, the difference in entropies is

where $L$ and $L'$ are the likelihood functions of the model $g$ and $g'.$ Thus, while the logic was a bit complicated, the difference in entropies shows us an amazingly simple result! All that I need to do to compare the quality of two models is to look at the difference of twice the log-likelihood and the number of fit parameters. This is the Akaike information criterion (AIC). The AIC is an extremely useful metric to decide whether a model is being over- or under-fitted!

References:
1. Jaynes, E.T., 1957. Information Theory and Statistical Mechanics. Physical Review 106-4, 620-630.
2. Pathria, R.K., 1972. Statistical Mechanics.
3. Akaike, H, 1985. Prediction and Entropy. A Celebration of Statistics 1, 1-24.

# Ising model and voting patterns

The Ising model is a ubiquitous model in condensed matter physics as it is a simple model with nontrivial behavior. It is one of the few systems that has analytic solutions, though this only holds for the Ising model in low dimensions. Despite its simplicity, it does quite well at describing the ferromagnetic systems.

While the Ising model is typically applied to ferromagnetic spin systems, even within physics it can also be used to describe a gas that lives on a lattice. Here, I will show how the model can be extended outside of physics and describe how the Ising model relates to voting.

Assume there are $N$ people that will vote in an election. For simplicity, assume there are just two candidates in the election (or the vote is for a yes/no proposition). Let the state of voter $i$ be characterized by a variable $s_i$ that take the value of $+1$ or $-1,$ corresponding to voting for one of the two outcomes of the election. The assumption is that each voter will interact with other voters and convince them that they should vote a certain way. Let the energy associated with a voter agreeing with someone they interact with be $-\epsilon$ and the energy associated with a voter disagreeing with someone they interact with be $+\epsilon.$

For each person, there will be a propensity to vote a certain way, such as being affiliated with a political party or being subjected to advertising campaigns. Let the energy of this propensity be called $h_i$ for person number $i.$ Let $h_i$ be measured on a scale of $-\beta$ (always voting for candidate $-1$) to $+\beta$ (always voting for candidate $+1$). If a voter has a 0 value for this propensity, they are a "swing voter" and can be convinced by his or her neighbors to vote for either the $+1$ or $-1$ candidate. In a physical system, this propensity is equivalent to the existence of an external magnetic field in the spin system. Putting all this together, the total energy among the $N$ voters is

The angled brackets here mean only take the sum if $i$ and $j$ are able to interact. This is in here as clearly all $N$ voters will not talk to every one of the others unless $N$ is very small. This equation is precisely the equation for the Ising model. The equilibrium state will be one when this energy is minimized (though this minimum energy state is usually not unique). Now suppose all of the voters are on a 2D lattice (or grid), meaning that each voter interacts with 4 other voters. In the $\beta=0$ case (no political parties), then an analytic solution exists.

Configuration of voters on a 2D lattice with $\beta=0.$ Black are voters for candidate $+1$ and white are voters for candidate $-1.$ The left is the case where $\frac{\epsilon}{T} = 0.5,$ the right is where $\frac{\epsilon}{T} = 0.33.$ There is clearly clustering of similarly minded people on the left, but there is no such structure on the right. This is an example of a phase transition.

As shown above, in this case, depending on whether the "social temperature" is above or below a critical temperature, the voters will have random opinions or will form large regions of agreement. Note that as before, this temperature does not correspond to how hot it is where the voters live, but a measure of how nonideal voters are willing to be with regards to the model. If the temperature is high, random fluctuations dominate and so there is no correlation among voters. However, if the temperature is low, the voters tend to form regions where they agree with others. The critical temperature can depend on the issue at hand. I tend to agree with most of the people I associate with, so in my experience it seems that voting systems are usually in the low energy limit. However, there are some issues that are not as clear (such as GMO labeling) where the voting system can be in the high energy limit.

Now let me consider the effect of political parties. This case cannot be solved analytically, so the only way to solve the problem is numerically. The best way to do this is using the Metropolis algorithm. In this method, we first assign randomly to all voters which candidate they will vote for (give them $+1$ or $-1$). We then choose a voter at random and consider the energy change $\Delta E$ if we changed that voter's vote. If switching the vote will decrease the total energy of the system, then the voter's vote is changed. If not, the voter's vote is changed with probability $e^{-\Delta E/T},$ where $T$ is the temperature of the system. If $T=0,$ this second condition never occurs and thus a voter's opinion is changed only if it decreases the total energy. Thus, as mentioned earlier, a nonzero temperature has the effect of measuring how nonideal voters are willing to be which corresponds to increasing the energy. This probability is called a Boltzmann factor and is ubiquitous in statistical physics. After this is done, another voter is chosen randomly and the process is repeated many times. After a long time, this algorithm produces a state that is typical of an equilibrium configuration of the voters.

I assume my voters live on a 2D lattice. This means that each voter interacts with 4 other voters and no one else. One reason I choose this is that it is particularly easy to visualize. I chose the number of voters to be 62500. I repeated the Metropolis algorithm $10^7$ times, so that every voter will probably be considered by the algorithm many times. I took periodic boundary conditions such that voters on the edge interact with voters on the other edge. This insured that every voter interacted with 4 other voters.

Configuration of voters on a 2D lattice with $\frac{\epsilon}{T} = 0.5$ and $\frac{\beta}{T}=2.$ The $h_i$ values for each voter were assigned randomly between $-\beta$ and $+\beta.$ Black are voters for candidate $+1$ and white are voters for candidate $-1.$ Unlike the $\beta=0$ case, there is now less clustering of like-minded voters.

The first thing I tried is shown above. I randomly assigning values of $h_i$ between $-\beta$ and $+\beta$ to each of the voters. I still assign randomly the initial configuration of votes. This has the effect of adding more disorder to the voters, and can prevent clusters of similarly minded voters from forming, even at low temperatures.

Configuration of voters on a 2D lattice with $\frac{\beta}{T}=0.5,$ with $\beta$ distributed from negative to positive linearly from one end of the lattice to the other. Black are voters for candidate $+1$ and white are voters for candidate $-1.$ The left is the case where $\frac{\epsilon}{T} = 0.5,$ the right is where $\frac{\epsilon}{T} = 0.33.$ There is a tendency in both images for votes to cluster toward one side.

This, however, doesn't seem too realistic, as like-minded people usually associate with each other. Because of this, I tried adjusting $h_i$ such that it varies linearly from one edge of the lattice to the other. This has the effect of encouraging order, and predictably groups similarly minded people (when $\beta=0,$ the choice of grouping is random and is an example of spontaneous symmetry breaking).

When the choice of $h_i$ is such that positive and negative values are equally likely, there is no clear winner in the election (the difference in $+1$ and $-1$ votes is usually very close to 0). However, if there is an asymmetry between the number of people who can have negative $h_i$ values and positive $h_i$ values, this is usually a good indicator of who will win the election. This asymmetry could be due to something as simple as one campaign having better ads than the others and getting the swing voters.

One extension that could be made is to have different weights for certain interactions. Reading a distant friend's post on Facebook is probably less relevant to you than a discussion with a close family member. To account for this, the value of $\epsilon$ could be adjusted such that it takes on different values for different pairs of interactions. Another possible extension is to consider different interaction schemes. It seems unlikely that every voter would have only four people they interact with. We could do something more realistic such as having a range of a number of interactions each voter can have. This is a little more complicated to program, however, as a different data structure (a graph) has to be used.

You can see the code I used here.

References:
1. Schroeder, Daniel. 2000. An Introduction to Thermal Physics.

# City Population and the Fokker-Planck equation

Zipf's law states that many types of data, when sorted, look like a Pareto distribution and when plotted on a log-log scale look like a straight line. In other words, many features tend to have power law behavior. This is simply described by the Pareto principle which states that often a small percentage of causes account for a large percentage of effects. This is exemplified by the 99% slogan used by the Occupy movement stating that the top 1% of people in the U.S. earned most of the income.

Zipf's law seems to be quite ubiquitous. Examples where similar trends have been observed are the number of crimes criminals commit, the amount of healthcare patients receive, the amount of complaints customers give, and many others [1]. This article is about the distribution of people across cities, which also follows Zipf's law.

[2] and [3] look at Zipf's law for wealth distribution. Here, I'll present the same basic argument they do, but in terms of population, as this seems a little more intuitive to me.

The above is a plot of the population of U.S. cities with more than 50,000 people by their rank in size. The data is from [4]. The orange line is the linear (on a log-log scale) fit. As you can see, the line fits the data quite well, especially toward the smaller cities. The slope of the line is about 0.725.

Now I will consider the mathematical model. Say there are $\mathcal{N}$ cities. The only assumption that I will make is that people can move between two cities following a simple rule. I will assume that two cities can move some (presumably small) constant $\beta$ times the minimum of the population of the two cities. I assume that the flow is random, such that this movement can occur either way.

Now assume two cities have population $n$ and $n'.$ The change, $\Delta,$ in population of the city with population $n$ after someone moves is

The change in population of the city with population $n'$ is $-\Delta.$ $r$ is a random variable that has a 50% chance of being +1 and a 50% chance of being -1. The PDF, $P(n,t)$ of a system undergoing a Markovian process of small steps like this is given by the Fokker-Planck equation.

where

is the Pareto function and

The angled brackets here represent expectation values. For a derivation of this relation, see [3]. In plasma physics, if the plasma is sufficiently dense, the constituents of the plasma will have many interactions with one another. If one particle in the plasma is looked at, the path looks like a simple random walk, which is a Markovian process. Thus, the PDF of the plasma will obey what looks like a Fokker-Planck equation. Similar arguments can be used to obtain a similar relation for the PDFs of particles undergoing Brownian motion. The Fokker-Planck equation has even found uses in mathematical finance.

One thing to note is that the value of $\beta$ cannot change the dynamics. I can change my units of time to absorb $\beta,$ and so nothing critical should depend on it (and I do not include $\beta$ when I do the numerical integration). The PDE is not easy to solve analytically as it mixes up derivatives and integrals of the PDF $P(n,t).$ Thus, I solve it numerically by assuming an initial distribution for $P(n,0)$ and letting it evolve in time until it seems like the PDF has reached a steady state. The initial distribution I assumed was exponential.

The population distribution of the model. Given the blue initial conditions, the system evolves to the purple line which looks linear over a large number of ranks. The straight orange line has slope 1.05. I believe that the kink that develops is a numerical effect and could be reconciled if the step size in space were smaller. However, because of the CFL condition this requires an even smaller time step and thus would make the code take an even longer amount of time.

Here, it is apparent that the PDF approaches a straight line in the log-log plot, as was apparent in the U.S. census data. It turns out that the choice of the initial condition does not matter much. In [2] the authors choose a much different initial condition but obtain a similar steady state solution after a long time.

One key difference between the result of this model and the U.S. population is that the slope of the lines are quite different. In the plot above, the slope of the line is about 1.05, while the U.S. population has a slope of about 0.725.

One way to modify the model to potentially reconcile this is to allow for moving to and from a city to have different probabilities. Intuitively, I'd expect more people would want to go from a smaller city to a bigger city rather than the other way around. It turns out this effect is pretty easy to incorporate. In the relation above for $\Delta,$ I change $r$ such that it is a random variable that has an $a$ chance of being +1 and a $1-a$ chance of being -1. Before, I took $a=1/2.$

In this case, $<\Delta>$ is no longer zero but the other term is exactly the same. The expression becomes

where

Note that now $\beta$ is almost a relevant parameter as it can no longer be completely absorbed into the definition of time. The relevant parameter here that will control the dynamics will be $\frac{2a-1}{\beta}.$

The population distribution of the modified model with $\frac{2a-1}{\beta} = 0.2.$ Given the blue initial conditions, the system evolves to the purple line which looks linear over a large number of ranks. The straight orange line has slope 0.8.

I solved this numerically in the same manner as before. In the above plot, the slope of the line is about 0.8, which is much closer to the population data. All this required was for moving to a big city to be about 10% more desirable than moving away from the city (though the exact percentage depends on the relationship between $a$ and $\beta$). I'm sure by tuning the ratio $\frac{2a-1}{\beta}$ further I could get it even closer to the 0.725 slope observed in the U.S. population.

There are also other factors that could be considered, such as the birth and death of city residents as well as immigration. [2] describes how to take these into account, but for the sake of simplicity I will not consider it here. An interesting aspect is that including just one of these cannot change the behavior (other than the size of the population increasing or decreasing), though adding both can lead to different behavior.

As I was writing this, I was trying to think whether something like crime could be modeled as a similar way, but I couldn't think of a reasonable exchange (like the $\Delta$ expression) that could be written. It would be interesting to look into how Zipf's law is emergent in other systems as well.

The code is available here. The code runs pretty slowly, it has not been optimized. The easiest optimization would probably be to do the computations on a GPU using pyCUDA. I'll leave that as an exercise to the reader ;).

References:
1. Reed, William J.; et al. 2004. The Double Pareto-Lognormal Distribution – A New Parametric Model for Size Distributions. Communications in Statistics – Theory and Methods 33 (8), 1733–1753.
2. Boghosian, Bruce M. 2014. Kinetics of wealth and the Pareto law. Physical Review E 89, 042804.
3. Boghosian, Bruce M. 2014. Fokker-Planck Description of Wealth Dynamics and the Origin of Pareto’s Law. arXiv.
4. http://www.census.gov/popest/data/cities/totals/2011/

# Traffic as a Fluid Flow

Whenever there is a flow of objects such as pedestrians [1] or cars [2], they can be described on a macroscopic scale (on average) as a flow of a fluid. Fluid flow is described by the Navier-Stokes equations. The equation defines a velocity field $\vec{u}(\vec{x},t),$ which satisfies

Here, $\rho$ is the density of the fluid, $P$ is the pressure and $\vec{f}$ is any force per unit volume acting on the fluid. As a side note, it turns out mathematicians study these equations a good bit as well (perhaps more than physicists). The existence and smoothness of the solutions to these equations are unknown in certain cases. If someone can prove this rigorously, he or she can claim a million dollar prize.

This equation is paired with the continuity equation, which is a statement that fluid cannot pile up. All fluid that comes in somewhere must leave. Mathematically, this is:

With both of these, the velocity field of the fluid can be solved for.

Let us consider the case of cars moving across the Bay Bridge. This is a straight path of about 4.5 miles where cars only enter or exit on the ends (assuming no cars, or very few cars, get off on Yerba Buena Island). Traffic can really only flow in one direction, so I will take the one-dimensional forms of the equations above.

The density, $\rho$ is simply the number of cars in a certain volume of space. The forcing term tends to speed up cars toward an optimal speed, $\tilde{u}$. When there are few cars on the road this optimal speed will be pretty high, most likely a little over the speed limit. However, as the density of cars gets higher, the optimal speed will lower [2]. The pressure can be empirically determined, but as expected it tends to increase with larger densities. With all this taken into account, the traffic Navier-Stokes equations are:

where $\tau$ is a relaxation time, which sets the scale of how quickly the fluid can get up to the optimal speed.

I will consider steady state solutions which will be applicable whenever a certain stimulus has been present for a long time. Of course it can be interesting to see what how the system gets there. For example, if there were an accident that occurred, it could be interesting to see how long it takes for the other side of the bridge to feel the effect of the accident. To keep it simple, however, I will consider the case when the accident has occurred and been there a while but not cleaned up yet. This means that all of the time derivative terms will be zero. This is nice because the continuity equation now becomes (after dividing through by $\rho$)

Now I will put in the speed of sound, which is the derivative of the pressure with respect to $\rho.$ Plugging this into the Navier-Stokes equation gives

Now this is just a first order equation for $u$ without any dependence on $\rho.$ It can easily be solved numerically. For the example on the bridge, I set the target speed $\tilde{u}$ to be 65 mph. The relaxation time $\tau$ should be related to how quickly cars can respond to external stimuli, so I take it to be a few seconds, as this seems like a reasonable time scale for a human driving the car. The speed of sound I took to be a free parameter and adjusted it to get reasonable dynamics. I assumed that once a car leaves the toll booth, it takes about half a mile or half a minute to reach full speed. This happened to make $c$ about 100 mph. The speed as a function of distance along the bridge is pretty uninteresting in this case. The cars move up to maximum speed and stay there.

The total time on the bridge can be found from the definition that $u(x) = \frac{dx}{dt}.$ Integrating this relation gives

where $L$ is the length of the bridge. In this model it takes about 4.5 minutes to cross the bridge.

Now there are various modifications that can be made to this model. If there is a slope or curve in the bridge, cars will slow down. This can be factored in by making $\tilde{u}$ a function of position on the bridge. The ideal speed can be lower for the regions one would expect cars to go slower. For a slope, one could even include the gravitational force as the forcing term in the Navier-Stokes equation, though this would reintroduce the dependence on density.

To test this, suppose for the last two miles of the bridge there are fireworks visible over San Francisco. There may be a tendency for people to look at this and slow down. To model this, I take the ideal speed to be 65 mph for the first 2.5 miles but 45 mph for the last 2. This results in the velocity looking like the plot shown. The time of the trip increases to 5.3 minutes.

In a steady state solution like this, we probably wouldn't expect to see a sharp transition like this, as we'd expect the cars on the edge to figure it out earlier and not have to stop so abruptly. Because of this, we need a more sophisticated model of the target speed function. I would rather have it go smoothly from 65 mph to 45 mph. I used a fourth order interpolating polynomial to create a function like this between x=0 and x=2.5. Here is a plot of the more realistic model.

The commute time changes a little bit between the models. It takes about 5.8 minutes to cross the bridge like this.

If the time dependent parts were kept, there might be some sign of evolution from the first kind of plot with the sharp edge to something smoother like the one above. This could be something interesting to look at. Others have studied how traffic jams form in models like this, in particular looking at soliton solutions that the authors have coined jamitons [2]. You can see a live jamiton forming in this video!

Here is the Mathematica file I used for this: Traffic.nb

References:
1. Appert-Rolland, C., Modeling of pedestrians. arXiv:1407.4282.
2. Flynn, M.R., et al., Self-sustained nonlinear waves in traffic flow. Physical Review E, 79:056113.