6

Here I'm attempting to implement a neural network with a single hidden layer to classify two training examples. This network utilizes the sigmoid activation function.

The layers dimensions and weights are as follows :

X : 2X4
w1 : 2X3
l1 : 4X3
w2 : 2X4
Y : 2X3

I'm experiencing an issue in back propagation where the matrix dimensions are not correct. This code :

import numpy as np

M = 2
learning_rate = 0.0001

X_train = np.asarray([[1,1,1,1] , [0,0,0,0]])
Y_train = np.asarray([[1,1,1] , [0,0,0]])

X_trainT = X_train.T
Y_trainT = Y_train.T

A2_sig = 0;
A1_sig = 0;

def sigmoid(z):
    s = 1 / (1 + np.exp(-z))  
    return s

def forwardProp() : 

    global A2_sig, A1_sig;

    w1=np.random.uniform(low=-1, high=1, size=(2, 2))
    b1=np.random.uniform(low=1, high=1, size=(2, 1))
    w1 = np.concatenate((w1 , b1) , axis=1)
    A1_dot = np.dot(X_trainT , w1)
    A1_sig = sigmoid(A1_dot).T

    w2=np.random.uniform(low=-1, high=1, size=(4, 1))
    b2=np.random.uniform(low=1, high=1, size=(4, 1))
    w2 = np.concatenate((w2 , b2) , axis=1)
    A2_dot = np.dot(A1_sig, w2)
    A2_sig = sigmoid(A2_dot)

def backProp() : 

    global A2_sig;
    global A1_sig;

    error1 = np.dot((A2_sig - Y_trainT).T, A1_sig / M)
    print(A1_sig)
    print(error1)
    error2 = A1_sig.T - error1

forwardProp()
backProp()

Returns error :

ValueError                                Traceback (most recent call last)
<ipython-input-605-5aa61e60051c> in <module>()
     45 
     46 forwardProp()
---> 47 backProp()
     48 
     49 # dw2 = np.dot((Y_trainT - A2_sig))

<ipython-input-605-5aa61e60051c> in backProp()
     42     print(A1_sig)
     43     print(error1)
---> 44     error2 = A1_sig.T - error1
     45 
     46 forwardProp()

ValueError: operands could not be broadcast together with shapes (4,3) (2,4) 

How to compute error for previous layer ?

Update :

import numpy as np

M = 2
learning_rate = 0.0001

X_train = np.asarray([[1,1,1,1] , [0,0,0,0]])
Y_train = np.asarray([[1,1,1] , [0,0,0]])

X_trainT = X_train.T
Y_trainT = Y_train.T

A2_sig = 0;
A1_sig = 0;

def sigmoid(z):
    s = 1 / (1 + np.exp(-z))  
    return s


A1_sig = 0;
A2_sig = 0;

def forwardProp() : 

    global A2_sig, A1_sig;

    w1=np.random.uniform(low=-1, high=1, size=(4, 2))
    b1=np.random.uniform(low=1, high=1, size=(2, 1))
    A1_dot = np.dot(X_train , w1) + b1
    A1_sig = sigmoid(A1_dot).T

    w2=np.random.uniform(low=-1, high=1, size=(2, 3))
    b2=np.random.uniform(low=1, high=1, size=(2, 1))
    A2_dot = np.dot(A1_dot , w2) + b2
    A2_sig = sigmoid(A2_dot)

    return(A2_sig)

def backProp() : 
    global A2_sig;
    global A1_sig;

    error1 = np.dot((A2_sig - Y_trainT.T).T , A1_sig / M)
    error2 = error1 - A1_sig

    return(error1)

print(forwardProp())
print(backProp())

Returns error :

ValueError                                Traceback (most recent call last)
<ipython-input-664-25e99255981f> in <module>()
     47 
     48 print(forwardProp())
---> 49 print(backProp())

<ipython-input-664-25e99255981f> in backProp()
     42 
     43     error1 = np.dot((A2_sig - Y_trainT.T).T , A1_sig / M)
---> 44     error2 = error1.T - A1_sig
     45 
     46     return(error1)

ValueError: operands could not be broadcast together with shapes (2,3) (2,2) 

Have incorrectly set matrix dimensions ?

Maxim
  • 52,561
  • 27
  • 155
  • 209
blue-sky
  • 51,962
  • 152
  • 427
  • 752
  • I just noticed your outputs have three columns, where more than one column can be set. Are you trying to do [Multi-label classification](https://en.wikipedia.org/wiki/Multi-label_classification)? – Imran Dec 16 '17 at 12:20
  • @Imran yes, it's multi label despite having three columns. – blue-sky Dec 16 '17 at 12:31

2 Answers2

5

Your first weight matrix, w1, should be of shape (n_features, layer_1_size), so when you multiply an input, X of shape (m_examples, n_features) by w1, you get an (m_examples, layer_1_size) matrix. This gets run through the activation of layer 1 and then fed into layer 2 which should have a weight matrix of shape (layer_1_size, output_size), where output_size=3 since you are doing multi-label classification for 3 classes. As you can see, the point is to convert each layer's input into a shape that fits the number of neurons in that layer, or in other words, each input to a layer must feed into every neuron in that layer.

I wouldn't take the transpose of your layer inputs as you have it, I would shape the weight matrices as described so you can compute np.dot(X, w1), etc.

It also looks like you are not handling your biases correctly. When we compute Z = np.dot(w1,X) + b1, b1 should be broadcast so that it is added to every column of the product of w1 and X. This will not happen if you append b1 to your weight matrix as you have it. Rather you should add a column of ones to your input matrix and an additional row to your weight matrix, so the bias terms sit in that row of your weight matrix and the ones in your input ensure they get added everywhere. In this setup you don't need separate b1, b2 terms.

X_train = np.c_(X_train, np.ones(m_examples))

and remember to add one more row to your weights, so w1 should have shape (n_features+1, layer_1_size).

Update for backpropagation:

The goal of backpropagation is to compute the gradient of your error function with respect to your weights and biases and use each result to update each weights matrix and each bias vector.

So you need dE/dw2, dE/db2, dE/dw1, and dE/db1 so you can apply the updates:

w2 <- w2 - learning_rate * dE/dw2
b2 <- b2 - learning_rate * dE/db2
w1 <- w1 - learning_rate * dE/dw1
b1 <- b1 - learning_rate * dE/db1

Since you are doing multilabel classification, you should be using binary crossentropy loss:

binary crossentropy loss

You can compute dE/dw2 using the chain rule:

dE/dw2 = (dE/dA2) * (dA/dZ2) * (dZ2/dw2)

I am using Z for your A2_dot since the activation hasn't been applied yet, and I'm using A2 for your A2_sig.

See Notes on Backpropagation [pdf] for a detailed derivation for crossentropy loss with sigmoid activation. This gives a pointwise derivation, however, whereas we are looking for a vectorized implementation, so you will have to do a bit of work to figure out the correct layout for your matrices. There is also no explicit bias vector, unfortunately.

The expression you have for error1 looks correct, but I would call it dw2, and I would just use Y_train instead of taking the transpose twice:

dw2 = (1/m) * np.dot((A2 - Y_train).T , A1)

And you also need db2 which should be:

db2 = (1/m) * np.sum(A2 - Y_train, axis=1, keepdims=True)

You will have to apply the chain rule further to get dw1 and db1, and I'll leave that to you, but there is a nice derivation in Week 3 of the Neural Networks and Deep Learning Coursera Course.

I can't say much about the line you are getting an error on besides that I don't think you should have that calculation in your backprop code, so it makes sense that the dimensions don't match. You might be thinking of the gradient at the output, but I can't think of any similar expression involving A1 for backprop in this network.

This article has a very nice implementation of a one hidden layer neural net in numpy. It does use softmax at the output, but it has sigmoid activations in the hidden layer and otherwise the difference in calculation is minimal. It should help you calculate dw1 and db1 for the hidden layer. Specifically, look at the expression for delta1 in the section titled "A neural network in practice".

Converting their calculation to the notation we're using, and using a sigmoid at the output instead of softmax, it should look like:

dZ2 = A2 - Y_train
dZ1 = np.dot(dZ2, w2.T) * A1 * (1 - A1) # element-wise product

dw2 = (1/m) * np.dot(dZ2, A1.T)
db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)

dw1 = (1/m) * np.dot(dZ1, X_train.T)
db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
Imran
  • 12,950
  • 8
  • 64
  • 79
3

Code review

I have examined your latest version and noticed the following mistakes:

  • (minor) In the forward pass, A1_sig is never used, maybe it's just a typo.
  • (major) In the backward pass, I'm not sure what you intended to use as a loss function. From the code it looks like a L2 loss:

    error1 = np.dot((A2_sig - Y_trainT.T).T , A1_sig / M)
    

    The key expression is this: A2_sig - Y_trainT.T (though maybe I just don't get your idea).

    However, you mention that you're doing multi-label classification, most probably binary. In this case, L2 loss is a poor choice (see this post if you're interested why). Instead, use logistic regression loss, a.k.a. cross-entropy. In your case, it's binary.

  • (critical) In the backward pass, you've skipped the sigmoid layer. The following line take the loss error and passes it through the linear layer:

    error1 = np.dot((A2_sig - Y_trainT.T).T , A1_sig / M)
    

    ... while the forward pass is going through the sigmoid activation after the linear layer (which is correct). At this point, error1 doesn't make any sense and its dimensions don't matter.

Solution

I don't like your variables naming, it's very easy to get confused. So I changed it and reorganized the code a bit. Here's the converging NN:

import numpy as np

def sigmoid(z):
  return 1 / (1 + np.exp(-z))

X_train = np.asarray([[1, 1, 1, 1], [0, 0, 0, 0]]).T
Y_train = np.asarray([[1, 1, 1], [0, 0, 0]]).T

hidden_size = 2
output_size = 3
learning_rate = 0.1

w1 = np.random.randn(hidden_size, 4) * 0.1
b1 = np.zeros((hidden_size, 1))
w2 = np.random.randn(output_size, hidden_size) * 0.1
b2 = np.zeros((output_size, 1))

for i in xrange(50):
  # forward pass

  Z1 = np.dot(w1, X_train) + b1
  A1 = sigmoid(Z1)

  Z2 = np.dot(w2, A1) + b2
  A2 = sigmoid(Z2)

  cost = -np.mean(Y_train * np.log(A2) + (1 - Y_train) * np.log(1 - A2))
  print(cost)

  # backward pass

  dA2 = (A2 - Y_train) / (A2 * (1 - A2))

  dZ2 = np.multiply(dA2, A2 * (1 - A2))
  dw2 = np.dot(dZ2, A1.T)
  db2 = np.sum(dZ2, axis=1, keepdims=True)

  dA1 = np.dot(w2.T, dZ2)
  dZ1 = np.multiply(dA1, A1 * (1 - A1))
  dw1 = np.dot(dZ1, X_train.T)
  db1 = np.sum(dZ1, axis=1, keepdims=True)

  w1 = w1 - learning_rate * dw1
  w2 = w2 - learning_rate * dw2
  b1 = b1 - learning_rate * db1
  b2 = b2 - learning_rate * db2
Maxim
  • 52,561
  • 27
  • 155
  • 209
  • thanks for this, should the forward step be part of the iteration loop ? My understanding is back propagation is a separate step ? – blue-sky Dec 22 '17 at 23:06
  • 1
    NN training is forward *and* backward passes over and over again. You can try doing just one of them, just for the experiment - the optimization won't work. – Maxim Dec 22 '17 at 23:11
  • to classify a single example should I modify the forward pass that takes average of w1 and b1 ? As currently forward pass classifies entire training set. – blue-sky Dec 22 '17 at 23:25
  • Do you mean training a single example? You can split `X_train` into batches and do the same with each batch – Maxim Dec 22 '17 at 23:27
  • I mean classify a new example once training is complete. For example [1,0,0,0] – blue-sky Dec 22 '17 at 23:30
  • this appears to work as expected : toclassify = [1 , 0 , 0 , 0] new = np.asarray([toclassify]) Z1 = np.dot(w1, new.T) + db1 A1 = sigmoid(Z1) Z2 = np.dot(w2, A1) + db2 A2 = sigmoid(Z2) – blue-sky Dec 23 '17 at 00:07
  • @blue-sky Correct, to get an inference result, just do the forward pass for the new data. Result array will contain the probabilities of each label. – Maxim Dec 23 '17 at 07:46
  • I should round the result array to nearest whole number for correct result ? – blue-sky Dec 23 '17 at 09:14
  • Do you mean rounding a number from `(0,1)`? . In your case, yes, it will give the target class. There can be many ways to interpret the probabilities, e.g. have a threshold, when the network is "unsure". – Maxim Dec 23 '17 at 09:20
  • im wondering if rounding each of the result array values produces the correct label when all of the rounded values are concatenated or is the result array an array of probabilities for each label where each element of the array corresponds to the probability of the predicted value. – blue-sky Dec 23 '17 at 09:28