Gradient Descent with Tanh, 0 gradient on incorrect classification?

Question

The derivative of the tanh(x) activation function is 1-tanh^2(x). When performing gradient descent on this function, this derivative becomes part of the gradients for the weights.

For example, with Mean Squared Error: dL/dw = (tanh(x) - y)*(1 - tanh^2(x))*dx/dw

When tanh(x) is equal to 1 or -1, the term tanh^2(x) becomes 1.

This means that if the right class is predicted, then 1-tanh^2(x) equals 0, and so the gradient of the loss becomes 0, and so the weights do not update.

However, for the same reason, this would appear to mean that if the exact wrong class is predicted, then the gradient is still 0, thus causing no update. Presumably, this is the opposite of what you want to happen.

Is this a problem? If so, how is this problem avoided/amended?

The `tanh` function is mainly used classification between two classes. [This](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6) & [this](https://stats.stackexchange.com/questions/12754/matching-loss-function-for-tanh-units-in-a-neural-net) might help you — Kalsi, Apr 23 '20 at 17:03
Please notice that including phrases like "*Edit: slight math error*" in your posts when updating them is meaningless and should be avoided as it just creates unnecessary clutter (edited out). — desertnaut, Apr 23 '20 at 17:22

score 0 · Answer 1 · answered Apr 23 '20 at 19:43

0

The problem is avoided by choosing the prediction values of 0 and 1 for each of the classes (in the case of classification of two classes). I.e. you would not normally one-hot-encode to negative values.

answered Apr 23 '20 at 19:43

sempersmile

481
2
9

dedObed · Answer 2 · 2021-04-06T08:19:59.850

Good news is that tanh(x) only becomes +/- 1 when x is +/- infinity, so you do not need to worry too much about this.

However, the gradients do become dampened for x of higher absolute value, so you should:

z-normalize your inputs and initialize weights in network the right way [1]
Use ReLU or its variants (LeakyReLU, PReLU, etc.) for deeper networks.

For further reading, search "vanishing gradients".

[1] http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

Gradient Descent with Tanh, 0 gradient on incorrect classification?

2 Answers2