Choosing the Right Gradient Descent: Batch vs Stochastic vs Mini-Batch Explained

In my previous post on gradient descent, I explained briefly what gradient descent means and what mathematical idea it holds. A basic gradient descent algorithm involves calculating derivatives of the cost function with respect to the parameters to be optimized. This derivative is calculated over the entire training set as a whole. Now if the data has samples in hundreds of thousands, the gradient descent algorithm will be cooked! It will take a significant amount of time to train the model over such a large data.

With progress in machine learning, customized gradient descent algorithms were developed to tackle such problems of huge training data, which is exactly what we are going to see in this blog. So in the blog, we will cover the three popular types of gradient descent types: batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. So let’s get started!

What was Gradient Descent again?

To all those readers who ignored the linked blog above on gradient descent, let me provide a brief introduction the the topic. Imagine a big, empty bowl and you roll a small ball inside. The ball will start rolling from one side, climb to the other, and back. This will repeat until it finds that one sweet spot to relax and be stable. This is what simple gradient descent does, where the “bowl” is the domain of the cost function, the “ball” is the value of the error function, and the “stability” is the local minimum (global if you are luckier than George Russell in 2024 Austrian GP!)

When parameters (or weights) initially randomly chosen, the hypothesis will deviate massively from the ground truth. Hence, the parameter should be so adjusted, that the hypothesis matches as close to the ground truth, in other words, the value of the cost function is as minimum as possible. This “adjustment” takes place by calculating the derivative of the cost function w.r.t. to the parameter, multiplying it with a scaling factor (called the learning rate), and subtracting it from the previous parameter value. That’s it! This is how basic gradient descent works.

Assuming that the concerned cost function is the mean squared error function, the gradient descent in that case can mathematically be described as:

\(gradients =\frac{2}{m} X^T (X\theta – y)\),

where,
X is the input data,
θ are the model parameters,
y is the target output,
m is the number of samples.

The gradients quantify the slope of the function with current theta values. To get the slope to zero (reach the function minima), the theta values are updated by subtracting the gradients from the current values. The gradients however are scaled by a factor called the learning rate. The learning rate is hyperparameter that avoids overshooting of gradients and help the model to converge smoothly.

\(theta_{new} = theta_{old} – \alpha * gradients\)

where α is the learning rate, a hyperparameter that controls how large a step is taken in the direction of the gradients. It helps the model converge smoothly without overshooting.

This is how a basic gradient descent algorithm works and helps a model to learn the data. But the researchers maybe felt bored, thought “Hey, let’s have some fun and create modified versions of gradient descent!”, and boom! We now have batch gradient descent, mini-batch gradient descent, and stochastic batch gradient descent. Let us have a look at them.

Batch Gradient Descent

Batch Gradient Descent or BGD is the simplest type of gradient descent where the gradients are calculated over the entire training data. The formula seen above can be considered as that of the batch gradient descent where X is a matrix containing training samples. This characteristic of batch gradient descent makes it efficient to work on data containing a large number of features, but slowly on data containing a large number of samples.

Let us consider a small and simple example to understand gradient descent.

import numpy as np
import matplotlib.pyplot as plt

def add_noise(x):
    return x + np.random.randint(-3,3)

X = np.linspace(0,10,100).reshape(100,-1)
y = np.zeros_like(X)

for i in range(X.shape[0]):
    y[i] = add_noise(X[i])

plt.scatter(X, y)
plt.xlabel('X', fontweight='bold')
plt.ylabel('y',fontweight='bold')
plt.title('Training Data', fontweight='bold')
gradient descent training data

We will now train the parameters theta for 1000 iterations over the entire data with a learning rate of 0.01. The cost will be monitored at every iteration.

l_r = 0.01 # learning rate
n_iterations = 1000
m = 100
theta = np.random.randn(2,1)

X_ = np.c_[np.ones((100, 1)), X]

for iteration in range(n_iterations):
    cost = 1/m * np.sum((X_.dot(theta) - y)**2)
    gradients = 2/m * X_.T.dot(X_.dot(theta) - y)
    theta -= l_r * gradients
    print(cost)
gradient descent fitting line
Data points with optimum fitting line
batch gradient descent
Gradients stabilize at Minima in BGD

As seen in the image above, one thing to note about batch gradient descent is that because the gradients are calculated over the entire batch, they stabilize when they reach the minimum of the function. This is unlike that of stochastic gradient descent which we will see next.

Stochastic Gradient Descent

In contrast to the batch gradient descent, stochastic gradient descent picks up a random sample from the data and calculates the gradient based on this single instance. This process is repeated for every sample randomly selected from the data set over several iterations. Because only one instance needs to be in memory at a time, as opposed to the entire batch as seen previously, SGD takes relatively less time to compute the gradients on an enormous training set.

The problem, however, with the stochastic nature is that there exists a possibility that certain samples may be repeated in calculating the gradients while some of them may not be touched at all. One solution to this problem is to shuffle the training data in every iteration. This in turn negates the very advantage of SGD being a fast algorithm as it makes the model converge slowly. Moreover, the gradient calculated on one sample may not be appropriate for the next one. This leads to a sudden bounce in the cost function and ultimately leads the model to instability during training. This can be seen in the image below.

gradient descent
Unstable behavior of Gradients in Stochastic Gradient Descent

Here’s the code to yield the above image:

l_r = 0.01 # learning rate
n_iterations = 100
m = 1000
theta = np.random.randn(2,1)
gradients = np.zeros((1000,2))
X_ = np.c_[np.ones((100, 1)), X]

for iteration in range(n_iterations-1):
    for sample in range(m):
        random_int=np.random.randint(0,100)
        x_sample = X_[random_int,:].reshape(1,-1)
        y_sample = y[random_int].reshape(1,-1)
        gradient = 2 * x_sample.T.dot(x_sample.dot(theta) - y_sample)
        gradients[[iteration]] =  gradient.T
        theta -= l_r * gradient

    clear_output(wait=True)
    plt.plot(gradients[:iteration+1,0],gradients[:iteration+1,1], 'r-')
    plt.scatter(gradients[:iteration+1,0],gradients[:iteration+1,1])
    plt.xlabel('theta_1', fontweight='bold')
    plt.ylabel('theta_2',fontweight='bold')
    plt.title('Stochastic Gradient Descent', fontweight='bold')
    plt.show()
    cost = 1/m * np.sum((X_.dot(theta) - y)**2)

The resulting model is good and sufficient but not optimal. Of course this very characteristic can turn out advantageous if the cost function is irregular and does not have a convex shape, but cases where such a function is used are rare.

So, Batch Gradient Descent performs slowly on very large data but is stable at the function minimum, whereas Stochastic Gradient Descent performs well on large data but is unstable at the minimum. A compromise can be found where the gradient descent takes in randomly selected small batches from the training set to calculate the gradients and update the weights.

Mini-Batch Gradient Descent

Mini-batch gradient descent computes gradients of randomly selected small sets of training set. This offers a compromise between computational efficiency and more stable updates. t splits the training dataset into small random batches (e.g., 32 or 64 samples) and calculates gradients over these batches. This combines the efficiency of SGD with the stability of BGD, often used in modern machine learning.

gradient descent

The image below illustrates the comparison between the 3 discussed gradient descents:

gradient descent
Comparison of different gradient descent techniques

As discussed earlier, BGD finds a straight path to the function minimum. Stochastic Gradient Descent shows the maximum randomness and fails to settle at the minimum value. Mini-batch gradient descent demonstrates similarity of stochasticity however by a much lower magnitude.

Summary

In this blog, we discussed three types of gradient descent algorithms: Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD), and Mini-Batch Gradient Descent. We started by revisiting the concept of gradient descent using an analogy of a ball rolling in a bowl, representing the process of minimizing the cost function.

In summary, if your dataset is relatively small, Batch Gradient Descent may be the best option for its stability. For very large datasets, Stochastic Gradient Descent allows faster updates but might struggle with convergence. Mini-Batch Gradient Descent is a go-to for deep learning and modern machine learning, where it balances speed and accuracy efficiently.

If you enjoyed this blog, do leave a follow on my social media as every single increase in my followers gives me the motivation to keep making such interesting blogs. 🙂

Leave a Reply