First off: I understand derivatives and the chain rule. I'm not great with math, but I have an understanding.
Numerous tutorials on Backpropogation (let's use this and this) using gradient descent state that we use the derivative of the transfer function (sigmoid or tanh) to calculate the gradient and hence which way to travel next. In these tutorials I see (t-o)(1-o)(o) as the formula for calculating the error for the output neurons, which seems to be the derivative of the error calculation (1/2)(t-o)^2 * (1-o). (t = target, o = output_actual)
Why do I not see the derivative of the transfer function (assuming sigmoid): e^x/((e^x + 1)^2) anywhere? Or for when tanh is used as the transfer function: sech^2(x) ... where x = weighted input?
Also, some tutorials use (target - actual) , (target - actual)^2 [ Sum of Squares - useful for negative outputs] or the squared error function: (1/2)(target - actual)^2.
Where is the derivative of the transfer function and which is the correct error formula to use?