-2

Apologies in advance if this question is not the conventional approach, where a snippet of code or a question about a code is involved. I'm just trying to understand certain specific points on the subject of Neural Networks. I was watching a YouTube video (by Siraj Raval - School of AI) about choosing the best Activation function for a neural network: https://www.youtube.com/watch?v=-7scQpJT7uo

1- I tried to understand his explanation of why Sigmoid is not an ideal Activation function to be used by Neural Networks anymore for the following reasons:

  • Sigmoids saturate and kill gradients.
  • Sigmoids slow convergence.
  • Sigmoids are not zero-centered.
  • OK to use on last layer.

First of all, the 1st and 2nd reasons I'm guessing are similar, or that the first reason leads to the second reason. Is that correct?

2- 3rd reason I did not understand (not zero-centered). At 5:52 in the video, Siraj explains the reason that "... output starts at 0 and ends at 1, that means the value after the function will be positive and that makes the gradient of weights either all positive or all negative. This makes the gradient updates go too far in different directions ...". This point I didn't understand. At least ideally it would be helpful to see mathematically how this is explained.

3- He then goes on to say that the Tanh function solves this. Again I didn't understand why (mathematcially).

4- a) Then at 7:20, he mentions that ReLU is best used for the hidden layers while a SoftMax function is used for the output layer. But doesn't specifically references which function. So would Sigmoid function be a good assumption here? b) He also adds on that a linear function should be used for regression "... since the signal goes through unchanged ...". What does he mean by this sentence?

5- Finally, he mentions the problem with ReLU where " ... some units can be fragile during training and die meaning a big gradient flowing through a neuron could cause a weight update that makes it never activate on any data point again. So then gradients flowing through it will always be zero from that point on ...". Again I didn't understand that explanation, especially without seeing the mathematical side of it, so the explanation makes sense.

I have a fair basic intuition of neural networks, and the Sigmoid function, but to dig into deeper explanations such as this video about different Activation functions, certain explanations I feel were just mentioned casually without explaining the reasoning with some maths as well.

Any help would be really appreciated. Many thanks.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Hazzaldo
  • 515
  • 1
  • 8
  • 24
  • Way too broad for SO, which is *not* a tutorial service; complaining that Siraj's videos go "without seeing the mathematical side of it" and "without explaining the reasoning with some maths as well" is a bit unfair and rather a contradiction in terms: his videos do not pretend to cover such things (his target audience is clearly different), and there are literally dozens of free courses and tutorials out there that cover these things in whatever mathematical detail you can take... – desertnaut Feb 08 '19 at 23:48
  • Well I wasn't complaining about Siraj, and definitely not "complaining" in any context. I know Siraj could explain it mathematically if he wants. He's very gifted and I really appreciate all the knowledge he's spreading to others. I know the purpose of his video here is to simply cover over the subject briefly, because to explain it in depth would need a lot of videos or articles to cover the foundation. My purpose here is not "complaining", but simply from referencing this video I wanted to search for more explanation. In terms of "too broad for SO", fair enough I understand. – Hazzaldo Feb 09 '19 at 18:31
  • On another note, I appreciate your answer @desertnaut, but it's not particularly constructive to take a post out of context negatively, when my purpose is simply to find more explanation on a learning resource, NOT "complaining" about the resource or the author. Thanks. – Hazzaldo Feb 09 '19 at 18:50
  • You read too much in a simple comment, which was just meant to explain why the question has already 3 closing votes as "too broad". I'll admit that the term "complaining" was not the most appropriate I could come up with; I was (obviously unsuccessfully) trying to express that trying to learn from Siraj and then coming here to SO seeking clarifications and mathematical explanations on pretty much everything covered by Siraj (except of *coding* issues, about which SO is all about) is rather puzzling, and arguably not good practice - but that's just me, of course... – desertnaut Feb 09 '19 at 19:07
  • Well, I wouldn't say I was reading too much into something, I merely was pointing out that using the wrong words like "complaining", is a bit of a strong word and can be taken out of context. And I wasn't asking for "everything covered by Siraj", but I picked out very specific points/statements made in the video to search for more answers. But yes I understand the rest of what you're saying. I admit I'm pretty new to SO as a question writer, and this is a learning curve for me. Thank you I'll look for answers else where. Many thanks – Hazzaldo Feb 09 '19 at 19:24

1 Answers1

0

Sigmoid. When working with neural networks you want this function, because it keeps the non linearity, of course, this in the output layer.

reLU, when training, use this in the hidden layers, you need the x > 0 value, so reLU takes this value. I suggest look at reLU, softmax is as well used, however you get better results in practice with reLU.

  • Hi Marco, thanks for the answer. I understand ReLU should be used in the hidden layers and Sigmoid is best used in the output layer, but as per my post I was after answers to the 5 points I mentioned i.e. the mathematical reasoning behind them. I'm afraid this doesn't answer the 5 points I listed, but thanks appreciate the answer in any case. – Hazzaldo Feb 09 '19 at 18:58