4

I have 1 input layer, 2 hidden layers and 1 output layer and for a single training example x with output y I have computed following :

x = [1;0;1]; 
y = [1;1;1]; 

    theta1 =

        4.7300    3.2800    1.4600
             0         0         0
        4.7300    3.2800    1.4600

    theta2 =

        8.8920    8.8920    8.8920
        6.1670    6.1670    6.1670
        2.7450    2.7450    2.7450

    theta3 =

        9.4460    6.5500    2.9160
        9.3510    6.4850    2.8860
        8.8360    6.1270    2.7270

theta1 controls mapping between input layer and layer1 theta2 controls mapping between layer1 and layer2 theta3 controls mapping between layer2 and output layer

But to compute gradient descent using : theta(i) = theta(i) - (alpha/m .* (x .* theta(i)-y)' * x)' where i = 1 or 2 or 3 the dimensions of x and y are incorrect. The dimensions are correct (by correct I mean can execute the theta calculation without an error) if x and y are 1x9 instead of 1x3. Do I need to change the architecture of my neural network or can I just set x and y to
x = [1;0;1;0;0;0;0;0;0]; y = [1;1;1;0;0;0;0;0;0]; so that the matrix multiplication works out ?

Update :

alpha=learning rate (.00001)
m=number of training examples (1)
theta(i) refers to theta1,theta2,theta3

I'm using vectorised gradient descent as described at Vectorization of a gradient descent code

Update2 :

This matlab code appears to work :

m = 1; 
alpha = .0000001; 
x = [1;0;1]; 
y = [1; 1; 1]; 
theta1 = [4.7300 3.2800 1.4600; 0 0 0; 4.7300 3.2800 1.4600]; 
theta1 = theta1 - (alpha/m) * (x' * (theta1 * x - y));

is theta1 = theta1 - (alpha/m) * (x' * (theta1 * x - y)); correct implementation of vectorised gradient descent ?

I understand this will cause issues for unrolling theta matrices to theta vectors as the dimensions will not be same but for working with theta matrices instead of theta vectors is this correct ?

Update : Formula is modified from Vectorization of a gradient descent code where gradient descent is given as : theta = theta - (alpha/m) * (X' * (X*theta-y)); I changed it to theta = theta - (alpha/m) * (x' * (theta * x - y)); , so (X*theta-y); changed to (theta * x - y);

Community
  • 1
  • 1
blue-sky
  • 51,962
  • 152
  • 427
  • 752
  • The statement `x.*theta(i)` seems fishy since the dimensions don't match, no matter if `x` is 1x3 or 1x9. Could you clarify a few things in your post: Is `theta(i)` something actually written in your code, or are you just writing it here to refer to `theta1`, `theta2`, or `theta3` simultaneously? Are `alpha` and `m` constants? – Geoff May 19 '16 at 20:55
  • @Geoff please see question update 'The statement x.*theta(i) seems fishy since the dimensions don't match.' I agree, I'm not sure if its an issue with my network architecture or something else. – blue-sky May 19 '16 at 21:02

1 Answers1

0

In your referenced post , X is a matrix containing m rows (number of training samples). In your case, m = 1, so X becomes a row vector. While in your initialization, x is a column vector. Thus the most simple change is to set x = x' and y = y' so both your input and output become row vectors.

The formula would still be

theta3 = theta3 - (alpha/m) * (x' * (x*theta3-y)) = 
  9.4458   6.5499   2.9160
  9.3510   6.4850   2.8860
  8.8358   6.1269   2.7270

The update rule for theta2 and theta1 would be similar.

Also the error term x*theta3-y always has the same shape as input x; and the raw update amount x' * (x*theta3-y) always has the same shape as theta3.

Community
  • 1
  • 1
greeness
  • 15,956
  • 5
  • 50
  • 80
  • you updated the function from 'theta1 = theta1 - (alpha/m) * (x' * (theta1 * x - y));' to 'theta3 - alpha / m .* x * (theta3 * x - y )' so my gradient descent function is incorrect' ? – blue-sky May 20 '16 at 22:55
  • I might be wrong, do you have a reference to your formula? – greeness May 20 '16 at 23:05
  • Yeah, given the reference, I knew what the problem is. – greeness May 20 '16 at 23:21
  • thanks, could you expand on the intuition behind your new gradient descent function ? – blue-sky May 21 '16 at 08:21
  • it's the same as your reference. If you want to know why it looks like this, you need to provide the activation function at each layer. – greeness May 24 '16 at 08:40