Understanding Gradient Descent with example

“Optimization is at the heart of machine learning. It is the process of searching through a space of possible models to find one that performs well on a task. It is a challenging and sometimes frustrating process, but the rewards are great: models that can make predictions, discover patterns, and learn from data.”

Ian Goodfellow, computer scientist and deep learning researcher.

Gradient descent is one of the most fundamental and popular optimization techniques in machine learning. It is a powerful technique to update and optimize the parameters of a particular model by minimizing the cost function. During this iterative process of optimization, the gradient adjusts the parameters in the direction of the minimum of the function curve and learns over the data to improve its performance.

In this blog, we will discuss this theory of gradient descent and visualize the converging effect through an example. Moreover, we will also see the influence of the learning rate on the learning process of the model.

What is Gradient Descent?

In context of machine learning, gradient descent allows us to minimize a cost function that overviews the performance of the model. Cost function measures the difference between the hypothesis of the model and the ground truth in the training data. In calculus, to minimize any function, we calculate its derivative with respect to the entity we want to optimize. This derivative of the function with respect to the optimizing parameters is called a gradient. The objective of this optimization technique is make the derivatives descend towards the minimum of the function curve, hence the name gradient descent.

gradient descent

The figure above illustrates the curve of a loss function (blue) and its derivative with respect to some parameter at a random point (yellow). As you can understand from the figure, the value of gradients will be significantly high during the initial epochs. This is due to the random initialization of parameters resulting in a high value of loss function, which leads to its larger derivative with respect to the parameters.

The gradient will decrease with a definite step size (learning rate) over several iterations as it reaches the minimum of the function. Ideally, the values of the gradients of the best model will hit a perfect zero after learning the data. However, such a case is not practical as it would result in a model that overfits the data. Thus, the aim is to minimize the loss function by decreasing the gradient values as low as possible.

Furthermore, the existing parameters are updated by subtracting a fraction of these derivatives from their previous values. The equation below is the mathematical representation of this statement:

\(p_{new} = p_{prev} – \alpha \cdot \frac{df(x)}{dx}\)

Because the step size or the learning rate has a small value (0 < step size < 1), the optimization of parameters takes place over several iterations before the function converges on to its minimum.

Visualizing the converging effect

Consider a loss function given as:

\(f(x) = \frac{A \cdot x^2}{2} – b \cdot x\)

where,
\(A = \left[\begin{matrix}2&1\\1&20\end{matrix}\right] \text{and }b = \left[\begin{matrix}5\\3\end{matrix}\right]\)

\(x\) is our parameter of interest and we would like to optimize it so that the loss function \(f(x)\) is minimum. As discussed above, we will calculate the gradient of the function:

\(\frac{df(x)}{dx} = A \cdot x – b \)

Now we will update the parameters by subtracting the product of the learning rate and the gradients:

\(p_{new} = p_{prev} – 0.01 \cdot \frac{df(x)}{dx}\)

The learning rate here is set to 0.085. There is no formula to determine this hyperparameter and is arbitrary. The image below shows the converging of gradients over 50 iterations.

gradient descent

Hence, we see that updating the parameters by calculating the gradient of the loss function helps in minimizing the function, thereby optimizing the model performance.

Influence of learning rate on the gradient descent

Till now, we saw how gradient descent works and how its helps in optimizing the parameter values. In the formula above, we came across a term called learning rate. This is the step size with which the gradients descend towards the function minimum. Choosing the correct learning rate is a matter of trail and error; there is perfect value for it. Having said that, the preferred values for the rate vary from 0.1 to 0.01. If the rate is set too low, the gradients would take significantly huge amount of time to descend, on the other hand, if too high, then the gradients would “bounce back”. Let us see this effect with an example.

In the example above, we had the learning rate of 0.085. For the same problem, setting the learning rate=0.001 yielded the following figure:

gradient descent

As it is clear, the model isn’t converged even after 50 epochs and would take more iterations, as a result, more time to learn the training data. On the contrary, setting the learning rate=1 results in the following:

gradient descent

Surprisingly, instead of converging, the gradients tend to increase drastically with every iteration. This issue is called exploding gradients and is a clear sign of learning rate set to high.

Thus the determination of the “correct” learning rate is only possible through conducting experiments over a range of values, and comparing the results.

Conclusion

To conclude, gradient descent is a technique to optimize parameters values of a model by calculating the derivatives of the loss function. Moreover, it is important to choose an appropriate value of learning rate to avoid the problem of exploding gradients on one side and non-convergence of the model on the other.

In case of doubts or issues, feel free to get in touch with me on my social media and hit on “Follow”:

Many such topics in machine learning may interest you. So keep in touch and receive weekly updates on the latest posts by subscribing to the MLS Newsletter:

Leave a Reply