Imagine training a deep network where, as you move backward through the layers, the learning signals just get tinier and tinier, almost disappearing into thin air. This makes it super tough for the earlier layers to learn anything useful. This is commonly known as the problem of *vanishing gradients.*

Vanishing gradient is one of the characteristics of an unstable model. This can result in the same output for every input that is fed into the model. The problem may not seem to occur in shallow models but is common in deep neural networks. In this post, we’ll break down what the vanishing gradient problem is, and what can be done to tackle it. Let’s dive in!

## Table of Contents

## What is the Vanishing Gradient problem?

While building neural networks, every layer learns the features of the input data. The learning capacity of the model increases with depth. This means that the deeper the model is, the more it will learn about the data. However, deep networks can lead to a problem in some cases, such that the model stops learning after a certain iteration. For every input, the model outputs a constant value that is zero or almost zero. This problem is known as the vanishing gradient problem.

## What causes it?

The vanishing gradient problem is caused by using inappropriate activation functions in deep networks. Consider an example in which you are building a classification model where the output is 0 or 1. Considering sigmoid is the suitable activation function in this case, you might set the activation function in every layer to sigmoid. But there are high chance that this would lead to vanishing gradient problem. How? The output of the first layer will lie between 0 and 1 because of the nature of sigmoid activation. No matter how big the input is, the value of the output lies between 0 and 1. Now during back propagation, the derivatives are multiplied together. Hence, for *n *hidden layers, the derivatives get multiplied *n* times which eventually diminishes the gradients.

## Solution to the problem

One thumb rule during such a crisis is to use the **ReLU activation function**. ReLU activation works similarly to the linear activation function, except that the negative values are returned as zero. So for a binary classification example, for instance, one may set ReLU as an activation function in all layers and sigmoid only in the last layer.

One other solution is to use a Residual Network (ResNet). As seen in the image below, the residual network adds the original features to the output thereby retaining the useful information of the input data.

## Conclusion

The vanishing gradient problem is caused by the repetition of certain activation functions in a model, for example, the sigmoid activation. The problem can, most commonly, be tackled by using the ReLU activation function to allow the model to learn the data during backpropagation.