The Modern Engineer: Optimization - Gradient (Steepest) Descent

Introduction

So you have an equation with 20 unknown variables. You've taken 1000 data points and need to get an equation to represent this data. Why would you need to do this?

Well it gives you a model of the data, which in turn can be used for simulations and estimations.

There are a lot of common uses for optimization. One use (in my field) is circuit optimization. When designing circuits, you sometimes need to fine tune the circuit for insertion loss or return loss. There have been times I've gotten stuck and used optimization to improve these parameters. In this article I'm going to give you a general understanding and try to provide a simple example.

General Algorithm for Gradient Descent Optimization

Gradient descent attempts to minimize a cost function by finding the gradient of the function and adjusting the variables in the opposite direction. The Gradient Descent algorithm tends to guarantee a minimum (not always, and sometimes not the global minimum).

This algorithm can be slow since you must compute a partial derivative for each unknown.

Cost Function
The first thing that needs to be created is a cost function. A cost function is function that we want to minimize.

For example, Let's say I have a circuit that is mismatched. I want to match it across a set of frequencies. Instead of doing it by hand I simply create an optimization algorithm. I want my return loss to be at least 30 dB. So the cost function would be (if Least-Squares).

$\bg_black$

where the Gamma_goal is 30 dB and Gamma_eqn is the output of our model.

The above is an example of the Least-Squares Cost Function. We're minimizing the difference between these two values. Gamma_goal is constant while Gamma_eqn is the model equation. In this case it's the Reflection coefficient.

Although the cost function can be anything, I tend to use least-squares most often. Here is a more generic form:

$S = \sum_{i=1}^{n} r_i^2$

$r_i = y_i - f(x_i,\beta)$

where beta is the parameters we're trying to find. yi and xi are the data points. In the case above, yi is the 30 dB number. I would have used there for every point.

Model Equation
The next problem is to create a model. The model must have a specific structure. It can be any equation with parameters. You can use this technique to fit lines to data as Excel does. In that case we would use an equation like so:

$\;f(x_i,\beta) = a_0 + a_1 x_i$

Using our data we would be trying to find a0 and a1. Now to find that data we must use the gradient descent algorithm.

Initial Value
Before we begin with the gradient descent, I must briefly speak about initial values. Initial values for the parameters ai must be set prior to the algorithm.

These initial values do matter (and sometimes to a large degree). This is largely because of the multiple valleys in the cost function that are possible. We desire to find the global minimum of the function but the gradient descent only locates local minima. If we're lucky it will find the global but this is not always the case.

So choose values that you believe to be close when using the gradient descent algorithm. This will increase the chances of finding the global minimum.

For instance, if we're performing line fitting, and we see the trend of the line passing through the y-axis at 20 then we should set a0 to 20.

Judgement must be used in picking initial values.

Gradient Descent Algorithm

The algorithm is based on the partial derivative of the parameters ai. Below is the equation used to determine the new values of ai:

$\;a_i^{'} = a_i - \gamma \frac{\partial S}{\partial a_i}$

The gamma is called the learning rate. The new ai' is calculated using a current ai. The learning rate should be chosen so that the jumps in the parameter is small enough not to miss the minimum and large enough so as not to take forever.

This also takes judgement but typically dynamic learning rates are used. These are learning rates that change based on the increase/decrease of the gradient.

Now the fun part...the partial derivative. In order for this algorithm to work, the partial derivative must be calculated. On a PC, an estimate works just fine for most applications. (It would defeat the purpose if we had to get an exact equation for it).

$\;\frac{\partial S}{\partial a_i} = \frac{S(a_0,a_1,...,a_i,...,a_N) - S(a_0,a_1,...,a_i + \epsilon,...,a_N)}{\epsilon}$

Epsilon is the change in the parameter to predict the limit. The closer this number to zero, the better. Of course PCs have limits but 1E-15 is usually small enough for most applications.

You must calculate all of these partial derivatives before changing the values. Never use new values of the parameters before all the derivatives have been calculated. This is a common mistake.

Once all of them are calculated, you can determine the new parameters. Then you repeat with the new values to find a closer approximation.

After so long your values will start to converge to some value. If this happens you have found a valid solution. Congrats!

There are caveats to this algorithm you must be aware of. Sometimes the learning rate is positive, sometimes negative. It's important to perform checks along the way to make sure the cost function is being reduced. If you don't do this it's possible to diverge with this algorithm as well.

Summary

The algorithm is fairly simple. Below is what must happen:

Develop a Model Equation for your data
Develop the Cost Function
Set Initial Values to the model
Determine all partial derivatives for the parameters (no parameter update)
Update parameters with the partial derivatives.
Perform any necessary checks
Repeat 4 - 6 until convergence

That should be about it. When you actually go to build this algorithm up, you will find some various things that must be done to make it efficient.

Good Luck and ask questions if you need help!

The Modern Engineer

Search This Blog

Tuesday, April 24, 2012

Optimization - Gradient (Steepest) Descent

Introduction

General Algorithm for Gradient Descent Optimization

Summary

1 comment: