Entropy and least squares

While most of the papers I reference in this blog are relatively new, this article will discuss an old idea, described in some detail in [1]. Entropy has been used as a concept in information theory for over half a century. One definition of the thermodynamic entropy is

Here $k_B$ is the Boltzmann constant and $p_i$ is the probability that the system is in the $i$th microstate. The sum is over all of the microstates, which is one particular configuration for the system. For example, if I were calculating the entropy of a gas molecule in a box, a microstate would be the gas molecule with a particular position and momentum.

Since this is an equation with probability in it, it seems natural to extend the idea to other fields that deal with probability. This is the idea behind the Shannon entropy of information theory and statistics. This entropy is defined in a similar way. If one has a discrete probability distribution, $p(x_i),$ then the Shannon entropy is

Here the sum is over all possible values of $x_i.$

The second law of thermodynamics states than an isolated system will always try to maximize its entropy. Thus, if I want to determine what the equilibrium configuration of a system is, I can do this by maximizing the entropy of the system. This is the approach [2] takes in "deriving" thermodynamics from basic statistical mechanics. The entropy is maximized subject to the constraint that the distribution should have a certain energy or particle number, and the Lagrange multipliers enforcing the constraints turn out to be related to temperature and chemical potential.

A similar idea can be applied to statistics. Suppose we would like for a probability distribution to have average $\mu$ and standard deviation $\sigma,$ but have as little other information encoded as possible. In other words, given a set of data points $x_i,$ I would like to find $p(x_i),$ the assignment of probability to each of the $x_i,$ such that the average $x_i$ value will be $\mu$ and the standard deviation is $\sigma^2$. I would do this by maximizing the Shannon entropy (by setting $dS = 0$) subject to some constraints. These constrains are

By setting $dS=0,$ the form of the equation for $p(x_i)$ becomes

Here, the $\lambda_j$ are the Lagrange multipliers. I then plug this form into the constraint equations to solve for the values of $\lambda_j.$ This gives

where $Z,$ the partition function, is

So I see the partition function is a useful object in the statistical sense as well. We can't simplify this any further without knowing specific values for $x_i,$ but given this information it would be easy to solve for the values of the Lagrange multipliers.

This procedure produces some reasonable results. For example, if the set of points is the entire real axis and I would like to apply the constraints above (though I have to do things a little differently since this is a continuous case), the distribution this procedure gives turns out to be a Gaussian. Thus, the Gaussian is a distribution over the whole real line that has a set average and standard deviation but encodes as little other information as possible.

There is a notion of "relative entropy" that may not be as familiar to physicists (at least I had never heard of it). This is called the Kullback-Leibler (KL) divergence. This can be quantified as (this is actually the negative of the KL divergence)

The KL divergence of a distribution $P$ from a distribution $Q$ quantifies how much information is lost when $Q$ is used to approximate $P.$ This seems like a nice thing to consider in the context of regression. I will follow [3] to use this to show how to compare two fit models and determine which one is more robust.

Let me assume there is some true distribution $f(y)$ and I am approximating it by a function $g(y|x).$ Now consider the expected entropy in $x$ (here $E_x$ will denote the expectation value with respect to $x$). This is

Now suppose there were another model, $g'(y|x).$ I would like to consider whether $g$ or $g'$ describes $f$ better. I can look at this by looking at the difference in the expected entropy of the two.

I have made a measurement of $\ln(g(y|x))-\ln(g'(y|x))$ by performing the fit, as this is a ratio of log-likelihoods. Asymptotically, (look at [3] for details) this measurement will differ from the expected value by $2(k-k'),$ where $k$ and $k'$ are the number of parameters used in the fit of $g$ and $g',$ respectively. Correcting for this bias, the difference in entropies is

where $L$ and $L'$ are the likelihood functions of the model $g$ and $g'.$ Thus, while the logic was a bit complicated, the difference in entropies shows us an amazingly simple result! All that I need to do to compare the quality of two models is to look at the difference of twice the log-likelihood and the number of fit parameters. This is the Akaike information criterion (AIC). The AIC is an extremely useful metric to decide whether a model is being over- or under-fitted!

References:
1. Jaynes, E.T., 1957. Information Theory and Statistical Mechanics. Physical Review 106-4, 620-630.
2. Pathria, R.K., 1972. Statistical Mechanics.
3. Akaike, H, 1985. Prediction and Entropy. A Celebration of Statistics 1, 1-24.

Ising model and voting patterns

The Ising model is a ubiquitous model in condensed matter physics as it is a simple model with nontrivial behavior. It is one of the few systems that has analytic solutions, though this only holds for the Ising model in low dimensions. Despite its simplicity, it does quite well at describing the ferromagnetic systems.

While the Ising model is typically applied to ferromagnetic spin systems, even within physics it can also be used to describe a gas that lives on a lattice. Here, I will show how the model can be extended outside of physics and describe how the Ising model relates to voting.

Assume there are $N$ people that will vote in an election. For simplicity, assume there are just two candidates in the election (or the vote is for a yes/no proposition). Let the state of voter $i$ be characterized by a variable $s_i$ that take the value of $+1$ or $-1,$ corresponding to voting for one of the two outcomes of the election. The assumption is that each voter will interact with other voters and convince them that they should vote a certain way. Let the energy associated with a voter agreeing with someone they interact with be $-\epsilon$ and the energy associated with a voter disagreeing with someone they interact with be $+\epsilon.$

For each person, there will be a propensity to vote a certain way, such as being affiliated with a political party or being subjected to advertising campaigns. Let the energy of this propensity be called $h_i$ for person number $i.$ Let $h_i$ be measured on a scale of $-\beta$ (always voting for candidate $-1$) to $+\beta$ (always voting for candidate $+1$). If a voter has a 0 value for this propensity, they are a "swing voter" and can be convinced by his or her neighbors to vote for either the $+1$ or $-1$ candidate. In a physical system, this propensity is equivalent to the existence of an external magnetic field in the spin system. Putting all this together, the total energy among the $N$ voters is

The angled brackets here mean only take the sum if $i$ and $j$ are able to interact. This is in here as clearly all $N$ voters will not talk to every one of the others unless $N$ is very small. This equation is precisely the equation for the Ising model. The equilibrium state will be one when this energy is minimized (though this minimum energy state is usually not unique). Now suppose all of the voters are on a 2D lattice (or grid), meaning that each voter interacts with 4 other voters. In the $\beta=0$ case (no political parties), then an analytic solution exists.

Configuration of voters on a 2D lattice with $\beta=0.$ Black are voters for candidate $+1$ and white are voters for candidate $-1.$ The left is the case where $\frac{\epsilon}{T} = 0.5,$ the right is where $\frac{\epsilon}{T} = 0.33.$ There is clearly clustering of similarly minded people on the left, but there is no such structure on the right. This is an example of a phase transition.

As shown above, in this case, depending on whether the "social temperature" is above or below a critical temperature, the voters will have random opinions or will form large regions of agreement. Note that as before, this temperature does not correspond to how hot it is where the voters live, but a measure of how nonideal voters are willing to be with regards to the model. If the temperature is high, random fluctuations dominate and so there is no correlation among voters. However, if the temperature is low, the voters tend to form regions where they agree with others. The critical temperature can depend on the issue at hand. I tend to agree with most of the people I associate with, so in my experience it seems that voting systems are usually in the low energy limit. However, there are some issues that are not as clear (such as GMO labeling) where the voting system can be in the high energy limit.

Now let me consider the effect of political parties. This case cannot be solved analytically, so the only way to solve the problem is numerically. The best way to do this is using the Metropolis algorithm. In this method, we first assign randomly to all voters which candidate they will vote for (give them $+1$ or $-1$). We then choose a voter at random and consider the energy change $\Delta E$ if we changed that voter's vote. If switching the vote will decrease the total energy of the system, then the voter's vote is changed. If not, the voter's vote is changed with probability $e^{-\Delta E/T},$ where $T$ is the temperature of the system. If $T=0,$ this second condition never occurs and thus a voter's opinion is changed only if it decreases the total energy. Thus, as mentioned earlier, a nonzero temperature has the effect of measuring how nonideal voters are willing to be which corresponds to increasing the energy. This probability is called a Boltzmann factor and is ubiquitous in statistical physics. After this is done, another voter is chosen randomly and the process is repeated many times. After a long time, this algorithm produces a state that is typical of an equilibrium configuration of the voters.

I assume my voters live on a 2D lattice. This means that each voter interacts with 4 other voters and no one else. One reason I choose this is that it is particularly easy to visualize. I chose the number of voters to be 62500. I repeated the Metropolis algorithm $10^7$ times, so that every voter will probably be considered by the algorithm many times. I took periodic boundary conditions such that voters on the edge interact with voters on the other edge. This insured that every voter interacted with 4 other voters.

Configuration of voters on a 2D lattice with $\frac{\epsilon}{T} = 0.5$ and $\frac{\beta}{T}=2.$ The $h_i$ values for each voter were assigned randomly between $-\beta$ and $+\beta.$ Black are voters for candidate $+1$ and white are voters for candidate $-1.$ Unlike the $\beta=0$ case, there is now less clustering of like-minded voters.

The first thing I tried is shown above. I randomly assigning values of $h_i$ between $-\beta$ and $+\beta$ to each of the voters. I still assign randomly the initial configuration of votes. This has the effect of adding more disorder to the voters, and can prevent clusters of similarly minded voters from forming, even at low temperatures.

Configuration of voters on a 2D lattice with $\frac{\beta}{T}=0.5,$ with $\beta$ distributed from negative to positive linearly from one end of the lattice to the other. Black are voters for candidate $+1$ and white are voters for candidate $-1.$ The left is the case where $\frac{\epsilon}{T} = 0.5,$ the right is where $\frac{\epsilon}{T} = 0.33.$ There is a tendency in both images for votes to cluster toward one side.

This, however, doesn't seem too realistic, as like-minded people usually associate with each other. Because of this, I tried adjusting $h_i$ such that it varies linearly from one edge of the lattice to the other. This has the effect of encouraging order, and predictably groups similarly minded people (when $\beta=0,$ the choice of grouping is random and is an example of spontaneous symmetry breaking).

When the choice of $h_i$ is such that positive and negative values are equally likely, there is no clear winner in the election (the difference in $+1$ and $-1$ votes is usually very close to 0). However, if there is an asymmetry between the number of people who can have negative $h_i$ values and positive $h_i$ values, this is usually a good indicator of who will win the election. This asymmetry could be due to something as simple as one campaign having better ads than the others and getting the swing voters.

One extension that could be made is to have different weights for certain interactions. Reading a distant friend's post on Facebook is probably less relevant to you than a discussion with a close family member. To account for this, the value of $\epsilon$ could be adjusted such that it takes on different values for different pairs of interactions. Another possible extension is to consider different interaction schemes. It seems unlikely that every voter would have only four people they interact with. We could do something more realistic such as having a range of a number of interactions each voter can have. This is a little more complicated to program, however, as a different data structure (a graph) has to be used.

You can see the code I used here.

References:
1. Schroeder, Daniel. 2000. An Introduction to Thermal Physics.

Random numbers from a distribution and the Lambert W function

When I was making Physibounce, one of the things that took me a while to figure out is something that is unfortunately barely noticeable in gameplay. When the speed of light becomes small, I wanted the speed distribution of particles to be a 2D Maxwell-Juttner distribution rather than the non-relativistic Maxwell speed distribution. The probability density function for particles allowed to be close to the speed of light is:

Where $v$ is the velocity of the particle, $\gamma(v) = (1-\frac{v^2}{c^2})^{-1/2}$ is the Lorentz factor, $k_B$ is the Boltzmann constant, $m$ is the mass of the mass of the particle, $T$ is the temperature of the system, and $c$ is the speed of light. This distribution will look like a Maxwell distribution in the $c\to\infty$ limit, and is normalized so that $\int_0^c f(v) dv = 1.$

The simplest way to generate random numbers following a distribution is the acceptance-rejection method. This is what I do in Physibounce when the speed of light is not changed. However, I was not able to find an expression for a suitable (and efficient) bound (or maximum value) for the distribution given above. Because of this, I decided to use inverse sampling instead. To do this, I first find the cumulative distribution function as

Amazingly, Mathematica was able to do this analytically. By the definition, it is clear $F$ takes on values in the interval $[0,1].$ Thus, if I find a relation that expressed $v$ in terms of $F$ (equivalent to inverting the function), I have an expression that takes uniform values on the interval $[0,1]$ and maps them to a velocity in the interval $[0,c]$ that follows the desired distribution. Mathematica was also able to do the inversion, but the result was in terms of the Lambert W function.

The Lambert W function is the function that satisfies

So it seems reasonable something like this would come up when inverting the cumulative distribution function. An important thing about this function is that it is double valued! There are two branches. When I asked Mathematica to solve the problem, it chose a branch and the branch it chose turned out to be the wrong one. I wondered for a while why I was not getting reasonable velocity values when I plugged in random numbers, and after a bit I realized I needed to choose the other branch.

The next problem I ran into was that I did know of a way to actually evaluate the Lambert W function in actionscript. So I turned to the asymptotic expansions. It turned out I was in a regime where the argument of the Lambert W tended to be large. I took the first few terms in the expansion about $\infty$ for the Lambert W. I verified that over the parameter ranges I could expect in Physibounce that the difference between the Lambert W and the expansion was less than 1%. Since it's pretty hard to look at the speeds of a bunch of objects and determine the distribution of their speeds, I thought this was more than good enough, though I did include a check to make sure no particles were created with speed greater than $c.$ In the end the code ended up only being a few lines, but quite a bit of thought went into each of the lines!

See the ActionScript code here.

Being Cautious of Mathematical Models

I recently read [1], which tries to create difficult Sudoku puzzles using mathematical modeling and physics. I found it interesting that while the arguments in the paper are sound, the "extremely challenging" puzzle they create is only somewhat challenging, and is much easier than the paper's examples of hard puzzles. So, what went wrong?

I am guessing the key issue was that only two methods for solving Sudoku puzzles were considered. While the two considered are common methods for solving the puzzles, there are other ways. In fact, the "extremely challenging" puzzle produced in the paper was only challenging if one stuck to the two methods, and was easier using other methods. While it is impossible for a mathematical model to consider every possible thought process that a human could make, including more of them can only make the results better. Modern galactic simulations require a handful of physics models including radiation, hydrodynamics, and gravity [2]. Only with all these different models are they able to make accurate predictions.

Does this mean that all models that don't take into account dozens of factors shouldn't be believed? Not at all. There can be great, accurate models that still "ignore" some of reality. A good example of this is Newtonian physics. When describing most objects we encounter in everyday life, we see that Newtonian physics works well. However, if we want to study particles moving close to the speed of light (such as a muon created in the atmosphere), Newtonian physics gives the wrong result and special relativity should be used. Does this mean learning Newtonian physics as a freshman in college is useless? Not at all, a lot of the world can be described by Newtonian physics with extremely small errors. To use special relativity in constructing a bridge would be overkill, and would increase the amount of work the engineers have to do (and if the engineers were smart, they would take the $c\to\infty$ limit and get Newtonian physics out anyways).

So, I think the bottom line in whether or not to accept a certain model is to ask whether various extensions could have a significant impact on the results. This may or may not be obvious. In [1], the authors believed that considering two solving methods would be sufficient to create difficult puzzles, but it turned out that adding more solving methods will probably change the results quite a bit. In physics we're usually pretty aware when our mathematical models break down (though some models like the standard model of particle physics work much better than we expect it to). However, if we're extending ideas of physics to other situations (such as the Sudoku difficulty models or the seating models discussed previously), it is important to be careful and one should reason or check that the models won't be affected drastically by extensions.

References:
1. Watanabe, H., Difficult Sudoku Puzzles Created by Replica Exchange Monte Carlo Method. arXiv:1303.1886.
2. Hayes, J.C., et al., Simulating radiating and magnetized flows in multiple dimensions with ZEUS-MP. The Astrophysical Journal Supplement Series, 165:188–228.

Modeling Seating Using Statistical Mechanics

When I ride buses and trains I always find it fascinating that there are always people who stand before all of the seats are taken (I sometimes follow this pattern as well). I figured this could be modeled as some finite-temperature effect in a statistical mechanics system. This seemed appropriate as the process of seating reminded me of adsorption of a molecule onto a surface. This led me to [1], where free-for-all seating on an airplane is modeled. Here, I will talk about bus seating in a similar language.

The first assumption I make is that people are ideal non-interacting fermions. This means that two people cannot sit in the same seat or stand in the same space and that people have no free will and will just follow the laws of physics. An interaction can be something like a patron avoiding sitting next to someone who has not showered or a group of friends wanting to sit together. These can clearly happen, but I assume that in the collective behavior of everyone in the bus, these small effects are negligible. In fact, interactions would not be too hard to incorporate into the model (though I have no idea how to incorporate free will). These may seem like a bad assumptions, but for the collective behavior of everyone in the bus, what an individual thinks should not be too relevant. After all, all I really want to know is how people will distribute themselves in a bus, not whether a particular person will sit somewhere.

Fig. 1: The bus model used. The blue are seats and orange are standing positions. I assume the maximum capacity of the bus is 64, which roughly agrees with some of the information I gathered from transit sites [2]. The seating arrangement is a typical one I have seen on a bus.

I will split the bus up into a lattice, with different spacings for seats and standers (with seaters taking more space). This is shown in the figure above. Now, assume the total capacity of the bus is $N$ people and there are $n$ people on a bus and these numbers are fixed (the bus is currently between stops). Since I am modeling the people as a fermion gas, the occupancy (the expected number of people) of a particular seat or a particular place to stand (labeled by index $i$) with energy $\epsilon_i$ is given by:

Note that in the statistical mechanics result, the temperature, $T,$ is usually multiplied by Boltzmann's constant. Since this is not a true statistical mechanics system, Boltzmann's constant is not applicable, so I will assume temperature is measured in units where Boltzmann's constant is 1. Note that the temperature here has no relation to how hot it is inside or outside the bus. In fact, the temperature of the system may be lower if it is hotter outside, since this might mean more people feel tired and would like to sit down. The temperature characterizes how likely it is that a person will choose a "non-optimal" seat. If the temperature were 0, the seats would all be filled in order of lowest energy. A finite (nonzero) temperature will allow for someone to choose to stand instead of sit, for example.

The energy, $\epsilon_i,$ is roughly how undesirable or uncomfortable a seat is. Thus, a seat will have a lower energy whereas the standing positions have a higher energy.

$\mu$ is the chemical potential. This is roughly the most likely energy a seat or space that will be occupied next will have. It can be solved for by the relation:

This can easily be solved numerically. I used the secant method to do it.

I will use a simple model for the energies. I will assume that a sitting spot has some negative energy $-\alpha.$ For the purposes of this study, I assume $\alpha$ is the same for every seat, but it would be trivial to generalize this so that it is different for each seat. I will assume that a standing spot has no energy (Note that, this could be changed so that seats have no energy and that standing has some energy $\alpha$ and the results would be unchanged). On buses I have also noticed an effect where there is an aversion to standing behind the back door. To model this, I will assume all seats in the back of the bus have an energy $\beta$ added to them. This could easily be generalized such that there is an aversion to being far from a door, but I will keep this model simple.

If the temperature is too high (relative to $\alpha$ and $\beta$) then the seats and standing spots look equivalent and all places have a roughly equal probability of being occupied. This is not the expectation, so I assume that I am not in this regime. Further, if the temperature is too low (again relative to $\alpha$ and $\beta$), the seats will all fill before anyone stands. Again, this is not what I observe so I expect that the temperature is not in this regime. Thus, I look at cases when $\alpha$ and $\beta$ are within an order of magnitude or two of the temperature. Empirically, I found that $\alpha/T$ should between 0.5-20. I also assume $\beta$ is less than $\alpha$ as sitting in the back should be more desirable than standing (which, you will recall, has a zero energy).

Fig. 2: Typical bus configurations for various values of $\alpha$ and $\beta$ for $n=35.$ The people are black dots. Again, blue are seats and orange are standing positions. Note there is no absolute configuration of people since statistical mechanics is probabilistic in nature.

The figure above shows representative arrangements of people for various values of $\alpha/T$ and $\beta/T.$ Multiple values of these can produce similar results. It seems that $\beta/T$ should be within a factor of 2 of $\alpha/T$ to produce "realistic" results, though this would be more formalized with real observations.

Here, I just wanted to illustrate how physics can be used to model people so I have only considered a really basic model. One would probably not want to draw many conclusions from this, but it can be extended. There are certainly more desirable seats (and standing spots) than others, so the energies can be altered to reflect this instead of having the same energy for all seats. Another interesting thing could also consider how the bus fills instead of looking at the final state. With enough data, the arrangement of seats on a bus could probably be optimized using a model like this such that everyone is most comfortable.

Of course, this kind of reasoning can apply to other scenarios as well. I was once in an airport waiting room and noticed that there were many people standing although there were available seats. A similar type of reasoning may reveal better ways to arrange the chairs. I have also noticed that in lecture halls people tend to sit on the edge making it hard to get to the middle seat. This could again be modeled similarly and perhaps lecture hall designs could be optimized. I hope to make another blog entries along these lines at some time.

If you'd like to play around with this yourself, you can modify the python code I wrote. Let me know if you discover anything interesting!

References:
1. Steffen, J. H., 2008b. A statistical mechanics model for free-for-all airplane passenger boarding. American Journal of Physics 76, 1114–1119.