The derivative of the tanh(x)
activation function is 1-tanh^2(x)
.
When performing gradient descent on this function, this derivative becomes part of the gradients for the weights.
For example, with Mean Squared Error
: dL/dw = (tanh(x) - y)*(1 - tanh^2(x))*dx/dw
When tanh(x)
is equal to 1
or -1
, the term tanh^2(x)
becomes 1
.
This means that if the right class is predicted, then 1-tanh^2(x)
equals 0
, and so the gradient of the loss becomes 0
, and so the weights do not update.
However, for the same reason, this would appear to mean that if the exact wrong class is predicted, then the gradient is still 0
, thus causing no update. Presumably, this is the opposite of what you want to happen.
Is this a problem? If so, how is this problem avoided/amended?