Entropy and least squares

While most of the papers I reference in this blog are relatively new, this article will discuss an old idea, described in some detail in [1]. Entropy has been used as a concept in information theory for over half a century. One definition of the thermodynamic entropy is

 S = - k_B \sum_i p_i \ln(p_i).

Here k_B is the Boltzmann constant and p_i is the probability that the system is in the ith microstate. The sum is over all of the microstates, which is one particular configuration for the system. For example, if I were calculating the entropy of a gas molecule in a box, a microstate would be the gas molecule with a particular position and momentum.

Since this is an equation with probability in it, it seems natural to extend the idea to other fields that deal with probability. This is the idea behind the Shannon entropy of information theory and statistics. This entropy is defined in a similar way. If one has a discrete probability distribution, p(x_i), then the Shannon entropy is

 S = - \sum_i p(x_i) \log_2(p(x_i)).

Here the sum is over all possible values of x_i.

The second law of thermodynamics states than an isolated system will always try to maximize its entropy. Thus, if I want to determine what the equilibrium configuration of a system is, I can do this by maximizing the entropy of the system. This is the approach [2] takes in "deriving" thermodynamics from basic statistical mechanics. The entropy is maximized subject to the constraint that the distribution should have a certain energy or particle number, and the Lagrange multipliers enforcing the constraints turn out to be related to temperature and chemical potential.

A similar idea can be applied to statistics. Suppose we would like for a probability distribution to have average \mu and standard deviation \sigma, but have as little other information encoded as possible. In other words, given a set of data points x_i, I would like to find p(x_i), the assignment of probability to each of the x_i, such that the average x_i value will be \mu and the standard deviation is \sigma^2. I would do this by maximizing the Shannon entropy (by setting dS = 0) subject to some constraints. These constrains are

 \sum_i p(x_i) = 1

 <x data-recalc-dims= = \sum_i p(x_i) x_i = \mu" />

 <x^2 data-recalc-dims= = \sum_i p(x_i) x_i^2 = \sigma^2 + \mu^2." />

By setting dS=0, the form of the equation for p(x_i) becomes

 p(x_i) = e^{-\lambda_1-\lambda_2 x_i-\lambda_3 x_i^2}.

Here, the \lambda_j are the Lagrange multipliers. I then plug this form into the constraint equations to solve for the values of \lambda_j. This gives

 \lambda_1 = \ln(Z)

 \mu = -\frac{\partial}{\partial \lambda_2} \ln(Z)

 \sigma^2+\mu^2 = -\frac{\partial}{\partial \lambda_3} \ln(Z)

where Z, the partition function, is

 Z = \sum_i e^{-\lambda_1-\lambda_2 x_i-\lambda_3 x_i^2}.

So I see the partition function is a useful object in the statistical sense as well. We can't simplify this any further without knowing specific values for x_i, but given this information it would be easy to solve for the values of the Lagrange multipliers.

This procedure produces some reasonable results. For example, if the set of points is the entire real axis and I would like to apply the constraints above (though I have to do things a little differently since this is a continuous case), the distribution this procedure gives turns out to be a Gaussian. Thus, the Gaussian is a distribution over the whole real line that has a set average and standard deviation but encodes as little other information as possible.

There is a notion of "relative entropy" that may not be as familiar to physicists (at least I had never heard of it). This is called the Kullback-Leibler (KL) divergence. This can be quantified as (this is actually the negative of the KL divergence)

 B(P;Q) = -\sum_i P(x_i) \ln\left(\frac{P(x_i)}{Q(x_i)}\right).

The KL divergence of a distribution P from a distribution Q quantifies how much information is lost when Q is used to approximate P. This seems like a nice thing to consider in the context of regression. I will follow [3] to use this to show how to compare two fit models and determine which one is more robust.

Let me assume there is some true distribution f(y) and I am approximating it by a function g(y|x). Now consider the expected entropy in x (here E_x will denote the expectation value with respect to x). This is

 E_x B(f;g(|x)) = - E_y \ln(f(y)) + E_xE_y\ln(g(y|x)).

Now suppose there were another model, g'(y|x). I would like to consider whether g or g' describes f better. I can look at this by looking at the difference in the expected entropy of the two.

 E_x B(f;g(|x))-E_x B(f;g'(|x)) = E_x(E_y\ln(g(y|x))-E_y\ln(g'(y|x)))

I have made a measurement of \ln(g(y|x))-\ln(g'(y|x)) by performing the fit, as this is a ratio of log-likelihoods. Asymptotically, (look at [3] for details) this measurement will differ from the expected value by 2(k-k'), where k and k' are the number of parameters used in the fit of g and g', respectively. Correcting for this bias, the difference in entropies is

  E_x B(f;g(|x))-E_x B(f;g'(|x)) = 2\ln(L)-2k-(2\ln(L')-2k')

where L and L' are the likelihood functions of the model g and g'. Thus, while the logic was a bit complicated, the difference in entropies shows us an amazingly simple result! All that I need to do to compare the quality of two models is to look at the difference of twice the log-likelihood and the number of fit parameters. This is the Akaike information criterion (AIC). The AIC is an extremely useful metric to decide whether a model is being over- or under-fitted!

1. Jaynes, E.T., 1957. Information Theory and Statistical Mechanics. Physical Review 106-4, 620-630.
2. Pathria, R.K., 1972. Statistical Mechanics.
3. Akaike, H, 1985. Prediction and Entropy. A Celebration of Statistics 1, 1-24.