0

First off: I understand derivatives and the chain rule. I'm not great with math, but I have an understanding.

Numerous tutorials on Backpropogation (let's use this and this) using gradient descent state that we use the derivative of the transfer function (sigmoid or tanh) to calculate the gradient and hence which way to travel next. In these tutorials I see (t-o)(1-o)(o) as the formula for calculating the error for the output neurons, which seems to be the derivative of the error calculation (1/2)(t-o)^2 * (1-o). (t = target, o = output_actual)

Why do I not see the derivative of the transfer function (assuming sigmoid): e^x/((e^x + 1)^2) anywhere? Or for when tanh is used as the transfer function: sech^2(x) ... where x = weighted input?

Also, some tutorials use (target - actual) , (target - actual)^2 [ Sum of Squares - useful for negative outputs] or the squared error function: (1/2)(target - actual)^2.

Where is the derivative of the transfer function and which is the correct error formula to use?

SilverFox
  • 115
  • 2
  • 10
  • Have you looked for 'Quasi-Newton' method ? – Seb May 27 '14 at 01:13
  • No, but I will. I am more concerned at understanding the most common (and easiest to find examples of) method of the error propogation before I begin modifying it further. – SilverFox May 27 '14 at 01:16
  • Sorry, I thought this could been an explanation why you can't find the derivative of the TF because pseudo-newton method don't need it – Seb May 27 '14 at 12:32

2 Answers2

1

Why do I not see the derivative of the transfer function (assuming sigmoid): e^x/((e^x + 1)^2) anywhere?

You do, its expressed as enter image description here with the derivative enter image description here in the wiki page you link. If we expand the latter we get

(1/(1+e^-x))*(1-1/(1+e^-x)) = e^x/(e^x+1)^2

which is the original form you noted.

Or for when tanh is used as the transfer function: sech^2(x) ... where x = weighted input?

Well, in this case its because the page doesn't mention the tanh as a potential activation function. But in real life it is expressed in a similar way so that we can avoid any unnecessary computations.

(target - actual)^2 [ Sum of Squares - useful for negative outputs] or the squared error function: (1/2)(target - actual)^2.

The difference is only a constant factor. The math comes out a little nicer if you keep the division by 2. In practice the only thing that would change is your learning rate gets implicitly multiplied/divided by 2 depending on which perspective you look at.

Also, some tutorials use (target - actual)

You probably misread. (t-a) would be the derivative of (t-a)^2/2 . Just (t-a) would have a derivative of -1, which I'm fairly sure would hinder learning for a nn.

Raff.Edward
  • 6,404
  • 24
  • 34
0

It is a common calculus topic to find derivatives for functions.

You may also use online mathematica to do so, here: http://www.wolframalpha.com/

The code to enter is

D[ 1/(1+e^(-x)), x ]

You can enter any function using Mathematica notation: http://integrals.wolfram.com/about/input/.

With the derivative, you can plug it in the general formula for error functions. When the derivative is too complex, you can try to use the function Simplify[...] to find better analytic forms.

As for choosing which transfer function to use, you can consider their domains and ranges. The logistic function (1 / (1 + exp(-x)) has range (0,1) but the atan(x) function has range (-1, 1). If you perform mathematical analysis on the learning algorithms, the choice of the transfer function may matter a lot. However, if you are running simulations, the choice of transfer functions should not be critical, as long as they have the S-shape (sigmoidal).

Another thing to pint out, the logistic function (1 / (1 + exp(-x) ) is only one instance of sigmoidal functions. atan(x) is also sigmoidal.

Danke Xie
  • 1,757
  • 17
  • 13
  • You clearly did not read my question. e^x/((e^x + 1)^2) is the derivative of the sigmoid function. – SilverFox May 27 '14 at 01:03
  • OK. It's corrected. The formula for the logistic function is 1/(1+e^(-x)). Please distinguish logistic function and sigmoidal functions, which is often confused. atan is also sigmoidal. – Danke Xie May 27 '14 at 01:18