Confusion about sigmoid derivative's input in backpropagation

Question

When using the chain rule to calculate the slope of the cost function relative to the weights at the layer L , the formula becomes:

d C0 / d W(L) = ... . d a(L) / d z(L) . ...

With :

z (L) being the induced local field : z (L) = w1(L) * a1(L-1) + w2(L) * a2(L-1) * ...

a (L) beeing the ouput : a (L) = & (z (L))

& being the sigmoid function used as an activation function

Note that L is taken as a layer indicator and not as an index

Now:
d a(L) / d z(L) = &' ( z(L) )

With &' being the derivative of the sigmoid function

The problem:

But in this post which is written by James Loy on building a simple neural network from scratch with python,
When doing the backpropagation, he didn't give z (L) as an input to &' to replace d a(L) / d z(L) in the chain rule function. Instead he gave it the output = last activation of the layer (L) as the input the the sigmoid derivative &'

def feedforward(self):
        self.layer1 = sigmoid(np.dot(self.input, self.weights1))
        self.output = sigmoid(np.dot(self.layer1, self.weights2))

def backprop(self):
        # application of the chain rule to find derivative of the loss function with respect to weights2 and weights1
        d_weights2 = np.dot(self.layer1.T, (2*(self.y - self.output) * sigmoid_derivative(self.output)))

Note that in code above the layer L is the layer 2 which is the last or output layer. And sigmoid_derivative(self.output) this is where the activation of the current layer is given as input to the derivative of the sigmoid function used as an activation function.

The question:

Shouldn't we use this sigmoid_derivative(np.dot(self.layer1, self.weights2)) instead of this sigmoid_derivative(self.output)?

score 3 · Accepted Answer · answered Jun 22 '20 at 10:09

It turned out that &( z(L) ) or output was used, just to accommodate to the way sigmoid_derivative was implemented.

Here is the code of the sigmoid_derivative:

def sigmoid(x):
    return 1.0/(1+ np.exp(-x))

def sigmoid_derivative(x):
    return x * (1.0 - x)

The mathematical formula of the sigmoid_derivative can be written as: &' (x) = &(x) * (1-&(x))

So to get to the formula above, &(z) and not z was passed to sigmoid_derivative in order to return: &(z) * (1.0 - &(z))

bhristov · Answer 2 · 2020-06-21T23:42:38.190

1

You want to use the derivative with respect to the output. During backpropagation we use the weights only to determine how much of the error belongs to each one of the weights and by doing so we can further propagate the error back through the layers.

In the tutorial, the sigmoid is applied to the last layer:

self.output = sigmoid(np.dot(self.layer1, self.weights2))

From your question:

Shouldn't we use this sigmoid_derivative(np.dot(self.layer1, self.weights2)) instead of this sigmoid_derivative(self.output)?

You cannot do:

sigmoid_derivative(np.dot(self.layer1, self.weights2))

because here you are trying to take the derivative of the sigmoid when you have not yet applied it.

This is why you have to use:

sigmoid_derivative(self.output)

edited Jun 21 '20 at 23:42

answered Jun 21 '20 at 22:36

bhristov

3,137
2
10
26

But why breaking the formula and getting `&' (&(z)) ` instead of getting `&'(z)` which will break the ratio between the components of the chain rule: `d C / d W(L) = d z(L) / d w(L) * d a(L) / d z(L) * d C / d a(L)` ? – EEAH Jun 21 '20 at 22:52
The main reason of the chain rule is to give the effect each variable have on the variation of the cost `C`, and `& ( z(L))` itself has an effect on that cost. Why not taking `&' (z(L))` and respecting the formula ? Why take `&' (&(z(L)))` ? – EEAH Jun 21 '20 at 22:54
Also at any stage `z (L)` is relative to the weights and previous activations. – EEAH Jun 21 '20 at 22:56
@EEAH I updated my answer. I hope that it is clearer now. – bhristov Jun 21 '20 at 23:42

score 1 · Answer 3 · answered Jun 22 '20 at 00:34

You're right- looks like the author made a mistake. I'll explain: When the network is done with a forward pass (all activations + loss), you have to use gradient descent to minimize the weights according to the loss function. To do this, you need the partial derivative of the loss function with respect to each weight matrix.

Some notation before I continue: loss is L, A is activation (aka sigmoid), Z means the net input, in other words, the result of W . X. Numbers are indices, so A1 means the activation for the first layer.

You can use the chain rule to move backwards through the network and express the weights as a function of the loss. To begin the backward pass, you start by getting the derivative of the loss with respect to the last layer's activation. This is dL/dA2, because the second layer is the final layer. To update the weights of the second layer, we need to complete dA2/dZ2 and dZ/dW2.

Before continuing, remember that the second layer's activation is A2 = sigmoid(W2 . A1) and Z2 = W2 . A1. For clarity, we'll write A2 = sigmoid(Z2). Treat Z2 as its own variable. So if you compute dA2/dZ2, you get sigmoid_derivative(Z2), which is sigmoid_derivative(W2 . A1) or sigmoid_derivative(np.dot(self.layer1, self.weights2)). So it shouldn't be sigmoid_derivative(self.output) because output was activated by sigmoid.

Confusion about sigmoid derivative's input in backpropagation

3 Answers3