Why multiply the error by the derivative of the sigmoid in neural networks?

Question

Here is the code:

import numpy as np

# sigmoid function
def nonlin(x,deriv=False):
    if(deriv==True):
        return x*(1-x)
    return 1/(1+np.exp(-x))

# input dataset
X = np.array([  [0,0,1],
                [0,1,1],
                [1,0,1],
                [1,1,1] ])

# output dataset            
y = np.array([[0,0,1,1]]).T

# seed random numbers to make calculation
# deterministic (just a good practice)
np.random.seed(1)

# initialize weights randomly with mean 0
syn0 = 2*np.random.random((3,1)) - 1

for iter in xrange(10000):

    # forward propagation
    l0 = X
    l1 = nonlin(np.dot(l0,syn0))

    # how much did we miss?
    l1_error = y - l1

    # multiply how much we missed by the 
    # slope of the sigmoid at the values in l1
    l1_delta = l1_error * nonlin(l1,True)

    # update weights
    syn0 += np.dot(l0.T,l1_delta)

print "Output After Training:"
print l1

Here is the website: http://iamtrask.github.io/2015/07/12/basic-python-network/

Line 36 of the code, the l1 error is multiplied by the derivative of the input dotted with the weights. I have no idea why this is done and have been spending hours trying to figure it out. I just reached the conclusion that this is wrong, but something is telling me that's probably not right considering how many people recommend and use this tutorial as a starting point for learning neural networks.

In the article, they say that

Look at the sigmoid picture again! If the slope was really shallow (close to 0), then the network either had a very high value, or a very low value. This means that the network was quite confident one way or the other. However, if the network guessed something close to (x=0, y=0.5) then it isn't very confident.

I cannot seem to wrap my head around why the highness or lowness of the input into the sigmoid function has anything to do with the confidence. Surely it doesn't matter how high it is, because if the predicted output is low, then it will be really UNconfident, unlike what they said about it should be confident just coz it's high.

Surely it would just be better to cube the l1_error if you wanted to emphasize the error?

This is a real let down considering up to that point it finally looked like I had found a promising way to really intuitively start learning about neural networks, but yet again I was wrong. If you have a good place to start learning where I can understand really easily, it would be appreciated.

A NN can be equally confident in a low value, a low result doesn't mean low confidence, it just means it is confident that it wont fire for this input. Why do you assume confidence is only related to firing outcomes? — AChampion, Aug 20 '17 at 22:28
Why does a low result mean it is confident that it wont fire for this input? What's the difference between that and 'confidence' — Meme Stream, Aug 20 '17 at 22:30
The derivative of the sigmoid indicates it's confidence, it can descend or ascend to confidence based on the learning. — AChampion, Aug 20 '17 at 22:32
That's what the article was saying...but my question is how? In this case, the sigmoid is only being used to squash the values to between 0 and 1. How does it indicate confidence when the only thing that should indicate confidence is error from the predicted output? — Meme Stream, Aug 20 '17 at 22:35
From my point of view is this SGD/GD issue. You try reach with your NN minimal of error function, for obtaining that you compute gradient derivative of error function by all weights and biases. If you derivate error function by weights, part of the result will be derivative of sigmoid(because of chain rule and only of sigmoid is your activation function). I recommend you read about Stochastic gradient descent algorithm and take look to error backpropagation derivative. And one additional point, if you use crossentropy error function you will avoid sigmoid derivation. — viceriel, Aug 21 '17 at 11:32

score 7 · Answer 1 · edited Apr 17 '18 at 05:25

Look at this image. If the sigmoid functions gives you a HIGH or LOW value(Pretty good confidence), the derivative of that value is LOW. If you get a value at the steepest slope(0.5), the derivative of that value is HIGH.

When the function gives us a bad prediction, we want to change our weights by a higher number, and on the contrary, if the prediction is good(High confidence), we do NOT want to change our weights much.

Sigmoid function and derivative

score 4 · Answer 2 · answered Oct 05 '17 at 11:28

First of all, this line is correct:

l1_delta = l1_error * nonlin(l1, True)

The total error from the next layer l1_error is multiplied by the derivative of the current layer (here I consider a sigmoid a separate layer to simplify backpropagation flow). It's called a chain rule.

The quote about "network confidence" may indeed be confusing for a novice learner. What they mean here is probabilistic interpretation of the sigmoid function. Sigmoid (or in general softmax) is very often the last layer in classification problems: sigmoid outputs a value between [0, 1], which can be seen as a probability or confidence of class 0 or class 1.

In this interpretation, sigmoid=0.001 is high confidence of class 0, which corresponds to small gradient and small update to the network, sigmoid=0.999 is high confidence of class 1 and sigmoid=0.499 is low confidence of any class.

Note that in your example, sigmoid is the last layer, so you can look at this network as doing binary classification, hence the interpretation above makes sense.

If you consider a sigmoid activation in the hidden layers, confidence interpretation is more questionable (though one can ask, how confident a particular neuron is). But error propagation formula still holds, because the chain rule holds.

Surely it would just be better to cube the l1_error if you wanted to emphasise the error?

Here's an important note. The big success of neural networks over the last several years is, at least partially, due to use of ReLu instead of sigmoid in the hidden layers, exactly because it's better not to saturate the gradient. This is known as vanishing gradient problem. So, on the contrary, you generally don't want to emphasise the error in backprop.

Why multiply the error by the derivative of the sigmoid in neural networks?

2 Answers2