Stochastic Gradient Descent

Stochastic gradient descent uses iterative calculations to find a minima or maxima in a multi-dimensional space. The words Stochastic Gradient Descent (SGD) in the context of machine learning mean:

  • Stochastic: random processes are used

  • Gradient: a derivative based change in a function output value

  • Descent: moving from greater to lesser values of a function output value

Derivatives are key to making stochastic gradient descent work, so let's take a look at them first before tackling SGD. Derivatives are shown graphically in the illustration below, where:

the curved blue line shows the function:

the derivative of this function is:

The tangent line of a curve shows its slope, which is also the derivative. The two straight lines show the tangent of f(x) at two points, 

red line - shows the tangent for x=2:

green line - shows the tangent for x=1:

Gradients provide a couple of key pieces of information that are essential for performing SGD:

  • The direction of descent - in the graph above, moving left descends the curve

  • The size of descent steps - when the derivative (slope) is larger, steps can be larger

Now let's take a look at a formula for using SGD to find a minimum for our function above. The formula iteratively uses a step function to subtract values from x to arrive at the estimated minimum:

using:

the SGD equation is:

In the graph illustration above:

Using SGD, we are iteratively reducing the value of x, using the formula above, to find the minimal value of y. The multiplier used in the x reduction equation is called the learning rate:

There are a number of approaches to calculating the learning rate. These include:

  • Momentum - remembers the update at each iteration and determines the next update as a linear combination of the gradient and the previous update

  • Averaging - records an average of its parameter vector over time

  • AdaGrad - increases the initial learning rate for more sparse parameters and decreases the learning rate for less sparse ones

  • RMSProp - (for Root Mean Square Propagation) is a method in which the learning rate is adapted for each of the parameters - the idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight

  • Adam - (short for Adaptive Moment Estimation) is an update to the RMSProp optimizer - in this optimization algorithm, running averages of both the gradients and the second moments of the gradients are used

So as we reduce the value of x the value of y will also decrease. Because we want to find the minimum value of y we want to stop iterating the reduction of x when y=0. In our example, we're able to look at the graph and see that the minimum is at y=0. But in actual SGD applications, the value of y is unknown. We won't be able to determine the exact minimum, so we'll want to stop the iteration of x when the reductions in x get very small.

Below are the calculations from running a number of iterations for the calculation of y based on the values shown:

The (x,y) pair values are plotted in the graph below. Notice that as the value of y approaches 0 the descent slows.:

SGD will typically be used to find the minimum for a function of multiple variables, not just one as we used in our example above. Also, the vertical axis dependent variable is typically referred to as a cost, or loss, function. 

The illustration below shows a cost/loss function J with two independent variables:

Moving down the gradient descent curve might look something like this:

https://www.youtube.com/watch?v=5u4G23_OohI

Let's say we have a cost/loss function J that is similar to the single variable function we explored but with two variables:

We would want to find the minimum of this function, which can be expressed as:

In our simple example above, we used the derivative of our function as part of the calculation to modify the variable x. Because we're now dealing with two variables, we would use partial derivatives with respect to each variable.

References