# Optimization Functions in Deep Learning

This will be a short post about optimization functions in deep learning. For simplicity, I'll show how these optimization functions work with a simple 1D example. Suppose that the loss function, as a function of weight, looks like the following.

With the standard gradient descent described in a previous post, the optimization needs to entirely avoid the flat region. In this region, the gradient is zero, and thus no update to the weights can be made. As shown below, for positive starting values of the weight, it is hard to avoid the zero gradient region. Even a negative starting value can end up in the flat region if the learning rate is large enough.

*Loss as a function of optimization steps starting with a weight of -10 (top) and a weight of 10 (bottom) for different learning rates. Note that the loss flattening out at is an artifact of the flat region of the loss function.*

Decaying the learning rate as the optimization time increases also doesn't really help here. The biggest problem is the algorithm will get stuck in the flat region, and changing the learning rate won't fix this issue.

**Momentum**

This example shows that the crux of the problem is that the update goes straight to zero when a flat region is encountered. The idea behind momentum is to track a *momentum* variable that updates with each step.

In these expressions, a subscript denotes the vector component to allow the expressions to generalize to multiple dimensions. There is now a new parameter, that tracks how much memory of previous gradients the optimization process should have. Note that when this reduces to standard gradient descent.

Implementing this on the function above, we see now that as long as the momentum parameter is large enough, the flat region can be avoided, even with the same learning rate as before. On top of that, the convergence is far faster than any of the choices above.

*Loss as a function of optimization steps starting with a weight of 10 for different momentum parameters.*

**RMSprop**

Typical deep learning models have many orders of magnitude more parameters (100000+). Suppose that during the training process, the gradients start out large for some of these variables but not others. There will be a significant update for those variables with large gradients. The gradients of the other parameters might become larger as the training proceeds. However, if the learning rate is decaying through the training process, these updates will not change the parameters as much as if these large gradients had appeared earlier in the training process.

Thus, RMSprop attempts to tune the learning rate for each parameter. If the gradient over the previous steps has been large for that parameter, the learning rate for that parameter will be smaller. This allows for smaller steps to *refine* the value of the parameter, as discussed in a previous post. Conversely, if the gradient over the last few steps has been small, we may want a larger learning rate to be able to move more in that parameter. We don't want the memory of the gradient sizes to be *too* long since it seems perverse for the first gradients to affect the 100000th step of training. Thus, a moving average is used to keep track of the gradient magnitudes. The updates of RMSprop are the following.

We see that is a moving average of the gradient magnitude for each parameter. The moving average is characterized by the parameter is a parameter chosen such that denominator in the fraction does not diverge. as before is the learning rate of the problem. Global learning rate decay can also be combined with this method to ensure learning rates for each parameter are bounded.

Since the update is always directly proportional to the gradient, for the problem at hand, RMSprop doesn't help and will get stuck much like traditional gradient descent.

**Adam**

If momentum works well, and RMSprop works well, then why not combine both? This is the idea behind Adam. The gradient descent updates of Adam are the following.

The two moving averages, corresponding to the momentum and gradient magnitude, are tracked and used for the update. Adam is my go-to optimizer when training deep learning models. Typically the only parameter that needs tuning is the learning rate, and the default values of the other parameters () tend to work really well. I applied Adam to the 1D example loss function with these default parameters and got some decent results. This looks a little worse than momentum, but this is because this problem is so simple. For most real-world problems Adam outperforms momentum.

*Loss as a function of optimization steps using Adam optimizer with default parameters starting with a weight of 10 for different learning rates.*