Apologies in advance if this question is not the conventional approach, where a snippet of code or a question about a code is involved. I'm just trying to understand certain specific points on the subject of Neural Networks. I was watching a YouTube video (by Siraj Raval - School of AI) about choosing the best Activation function for a neural network: https://www.youtube.com/watch?v=-7scQpJT7uo
1- I tried to understand his explanation of why Sigmoid is not an ideal Activation function to be used by Neural Networks anymore for the following reasons:
- Sigmoids saturate and kill gradients.
- Sigmoids slow convergence.
- Sigmoids are not zero-centered.
- OK to use on last layer.
First of all, the 1st and 2nd reasons I'm guessing are similar, or that the first reason leads to the second reason. Is that correct?
2- 3rd reason I did not understand (not zero-centered). At 5:52 in the video, Siraj explains the reason that "... output starts at 0 and ends at 1, that means the value after the function will be positive and that makes the gradient of weights either all positive or all negative. This makes the gradient updates go too far in different directions ...". This point I didn't understand. At least ideally it would be helpful to see mathematically how this is explained.
3- He then goes on to say that the Tanh function solves this. Again I didn't understand why (mathematcially).
4- a) Then at 7:20, he mentions that ReLU is best used for the hidden layers while a SoftMax function is used for the output layer. But doesn't specifically references which function. So would Sigmoid function be a good assumption here? b) He also adds on that a linear function should be used for regression "... since the signal goes through unchanged ...". What does he mean by this sentence?
5- Finally, he mentions the problem with ReLU where " ... some units can be fragile during training and die meaning a big gradient flowing through a neuron could cause a weight update that makes it never activate on any data point again. So then gradients flowing through it will always be zero from that point on ...". Again I didn't understand that explanation, especially without seeing the mathematical side of it, so the explanation makes sense.
I have a fair basic intuition of neural networks, and the Sigmoid function, but to dig into deeper explanations such as this video about different Activation functions, certain explanations I feel were just mentioned casually without explaining the reasoning with some maths as well.
Any help would be really appreciated. Many thanks.