22

I have just started programming for Neural networks. I am currently working on understanding how a Backpropogation (BP) neural net works. While the algorithm for training in BP nets is quite straightforward, I was unable to find any text on why the algorithm works. More specifically, I am looking for some mathematical reasoning to justify using sigmoid functions in neural nets, and what makes them mimic almost any data distribution thrown at them.

Thanks!

Anshul Porwal
  • 223
  • 1
  • 2
  • 4

1 Answers1

25

The sigmoid function introduces non-linearity in the network. Without a non-linear activation function, the net can only learn functions which are linear combinations of its inputs. The result is called universal approximation theorem or Cybenko theorem, after the gentleman who proved it in 1989. Wikipedia is a good place to start, and it has a link to the original paper (the proof is somewhat involved though). The reason why you would use a sigmoid as opposed to something else is that it is continuous and differentiable, its derivative is very fast to compute (as opposed to the derivative of tanh, which has similar properties) and has a limited range (from 0 to 1, exclusive)

mbatchkarov
  • 15,487
  • 9
  • 60
  • 79
  • 1
    Nice answer, but the assumption "continuous (and thus differentiable)" does not stand. Example: abs(x) which is continuous at zero but not differentiable. – Michael Sep 24 '14 at 00:42
  • 1
    The Wikipedia article says this, though: *"Kurt Hornik showed in 1991 that it is not the specific choice of the activation function, but rather the multilayer feedforward architecture itself which gives neural networks the potential of being universal approximators. The output units are always assumed to be linear."* In fact it doesn't seem to say anything about requiring a non-linear activation function. But the formal statement of the theorem does say "nonconstant, bounded, and monotonically-increasing continuous function" -- perhaps the *bounded* and monotone part implies nonlinearity? – Desty Nov 04 '14 at 15:05
  • 4
    @Desty, linear activation function turns whole network into linear classifier (linear combination of linear function is still linear), which makes hidden units useless. – Artem Sobolev Jan 09 '15 at 22:19
  • It is rather interesting though that the field of Deep Learning has turned to Rectifier Units, which is essentially a linear function. – chutsu Oct 24 '15 at 19:20
  • "Without it, the net can only learn functions which are linear combinations of its inputs." What does the "it" mean? 'the sigmoid function','non-linearity' or just 'activation function'? – squid Feb 23 '16 at 03:24