Mastering Optimization: A Guide to Calculating Derivatives in Machine Learning

Derivatives are a crucial concept in machine learning. It is a mathematical technique to optimize the model’s parameters to reduce the cost on every iteration and is widely used in machine learning to calculate gradient descent. This process may seem trivial on a shallow machine learning model but can get complicated when neural networks are involved.

In this blog, we will understand the fundamental meaning of derivatives. Below are the topics that we will cover today:

The Acceleration Analogy

Calculating derivatives may be simple but understanding their significance and practical implementation may seem a bit challenging. For this, I would like to use my favorite analogy when it comes to understanding derivatives.

calculating derivatives
Photo by Taras Makarenko on Pexels.com

Picture yourself behind the wheel of a sleek sports car, cruising down an open highway with an endless, straight road ahead. The thrill of the open road excites you as you gently press down on the accelerator pedal. Instantly, you feel the rush of speed as the car’s velocity starts to climb. Now, here’s where it gets interesting – this surge in velocity is precisely what physicists refer to as acceleration. But hold on, there’s a deeper layer to this excitement. Acceleration isn’t just about the speed boost; it’s the rate at which your velocity changes over time. In physics, we express this relationship through derivatives. So, when we talk about acceleration, we’re essentially decoding how the velocity is changing with respect to time. Hence, the derivative of velocity yields acceleration as acceleration (derivative) tells us how the velocity (function) changes over an independent parameter (time).

Similarly, when given any mathematical function, calculating derivatives tells us the rate of change of that function along a particular variable.

Calculating Derivatives in Machine Learning

In machine learning, one of the well-known functions is the cost function/ loss function. Among the different types of loss functions used in machine learning projects, one commonly used function is the Mean Square Error:

\(MSE = \frac{1}{n} \sum_{i=1}^{n}(y_i – \hat{y}_i)^2\)

The graph of cost function over different parameters may look something as following:

calculating derivatives

The Y-axis represents the value of cost function whereas the X-axis represents different values of parameters. As we learnt from the example with acceleration, where discrete change in velocity over time gives us the amount of acceleration, here too we focus on the discrete change of cost function over different parameter values. In other words, we calculate the derivative with respect to the parameter we want to optimize.

Taking a step further, we want to minimize the cost function. Taking a look at the graph, the cost function will minimize at the point where the curve takes a U-turn and starts increasing. Hence, we want to know the value of the parameter when the discrete change in the cost function is no longer decreasing and is equal to zero. Hence,

\(\frac{dJ}{d\theta} = 0\)

where, J is the cost function (for instance, mean squared error as seen above, and theta is the parameter we want to optimize).

Hence considering the MSE as our cost function, calculating its derivative w.r.t. the parameters will yield:

\(\frac{dJ}{d\theta} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x) – y^{(i)})\)

Remember that the derivative signifies the slope of a function at that particular point. Hence, looking at our MSE curve above, if the point lies on the left of the minima, the slope will be negative. Similarly, if the point lies on the right, the derivative or the slope of the function will be positive. So to reach the minima, we will update our parameters by subtracting the slope in every iteration:

\(\theta_{new} = \theta – \frac{dJ}{d\theta} \)

If the slope is negative, the above equation will add to the old parameter values, making the function shift towards the minima and vice versa.

Conclusion

In this blog, we had an intuitive understanding of derivatives. As a next step, proceed to understanding gradient descent where derivatives are used intensively to optimize parameters in machine learning. If you enjoyed this blog, please do leave a follow on my social media, it would take only a couple of seconds:

Subscribe to my monthly newsletter so that you do not miss any such interesting post on machine learning and programming:

Leave a Reply