0

I found that the derivatives of the common activation functions are ranged in [0,1]. https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html

It is the cause of gradient vanishing in RNN.

What is the reason that the derivatives are kept in [0,1] when activation functions were firstly introduced to deep learning? What will happen to a MLP if we use a variation of Relu, such as f(x) = max(0, 2x) with the derivative ranged in [0,2]

QuantCub
  • 3
  • 4

1 Answers1

0

The opposite of the vanishing gradient is the exploding gradient, causing the gradient to reach very high values. Just as a vanishing gradient causes great troubles during gradient descent, so does the exploding gradient, with especially large steps being taken during optimization.

This phenomenon is very important in RNNs, using Backpropogation-through-time, since the gradients of successive timesteps are effectively multiplied with each other during backpropogation. Thus, increasing the gradient values to [0, 2] would lead to a gradient increase of 2^n, increasing the possibility of an epxloding gradient.

JimmyOnThePage
  • 946
  • 8
  • 18