# Entropy and least squares

While most of the papers I reference in this blog are relatively new, this article will discuss an old idea, described in some detail in [1]. Entropy has been used as a concept in information theory for over half a century. One definition of the thermodynamic entropy is

Here is the Boltzmann constant and is the probability that the system is in the th microstate. The sum is over all of the microstates, which is one particular configuration for the system. For example, if I were calculating the entropy of a gas molecule in a box, a microstate would be the gas molecule with a particular position and momentum.

Since this is an equation with probability in it, it seems natural to extend the idea to other fields that deal with probability. This is the idea behind the Shannon entropy of information theory and statistics. This entropy is defined in a similar way. If one has a discrete probability distribution, then the Shannon entropy is

Here the sum is over all possible values of

The second law of thermodynamics states than an isolated system will always try to maximize its entropy. Thus, if I want to determine what the equilibrium configuration of a system is, I can do this by maximizing the entropy of the system. This is the approach [2] takes in "deriving" thermodynamics from basic statistical mechanics. The entropy is maximized subject to the constraint that the distribution should have a certain energy or particle number, and the Lagrange multipliers enforcing the constraints turn out to be related to temperature and chemical potential.

A similar idea can be applied to statistics. Suppose we would like for a probability distribution to have average and standard deviation but have as little other information encoded as possible. In other words, given a set of data points I would like to find the assignment of probability to each of the such that the average value will be and the standard deviation is . I would do this by maximizing the Shannon entropy (by setting ) subject to some constraints. These constrains are

= \sum_i p(x_i) x_i = \mu" />

= \sum_i p(x_i) x_i^2 = \sigma^2 + \mu^2." />

By setting the form of the equation for becomes

Here, the are the Lagrange multipliers. I then plug this form into the constraint equations to solve for the values of This gives

where the partition function, is

So I see the partition function is a useful object in the statistical sense as well. We can't simplify this any further without knowing specific values for but given this information it would be easy to solve for the values of the Lagrange multipliers.

This procedure produces some reasonable results. For example, if the set of points is the entire real axis and I would like to apply the constraints above (though I have to do things a little differently since this is a continuous case), the distribution this procedure gives turns out to be a Gaussian. Thus, the Gaussian is a distribution over the whole real line that has a set average and standard deviation but encodes as little other information as possible.

There is a notion of "relative entropy" that may not be as familiar to physicists (at least I had never heard of it). This is called the Kullback-Leibler (KL) divergence. This can be quantified as (this is actually the negative of the KL divergence)

The KL divergence of a distribution from a distribution quantifies how much information is lost when is used to approximate This seems like a nice thing to consider in the context of regression. I will follow [3] to use this to show how to compare two fit models and determine which one is more robust.

Let me assume there is some true distribution and I am approximating it by a function Now consider the expected entropy in (here will denote the expectation value with respect to ). This is

Now suppose there were another model, I would like to consider whether or describes better. I can look at this by looking at the difference in the expected entropy of the two.

I have made a measurement of by performing the fit, as this is a ratio of log-likelihoods. Asymptotically, (look at [3] for details) this measurement will differ from the expected value by where and are the number of parameters used in the fit of and respectively. Correcting for this bias, the difference in entropies is

where and are the likelihood functions of the model and Thus, while the logic was a bit complicated, the difference in entropies shows us an amazingly simple result! All that I need to do to compare the quality of two models is to look at the difference of twice the log-likelihood and the number of fit parameters. This is the Akaike information criterion (AIC). The AIC is an extremely useful metric to decide whether a model is being over- or under-fitted!

References:

1. Jaynes, E.T., 1957. Information Theory and Statistical Mechanics. Physical Review 106-4, 620-630.

2. Pathria, R.K., 1972. Statistical Mechanics.

3. Akaike, H, 1985. Prediction and Entropy. A Celebration of Statistics 1, 1-24.