2

I'm getting different results when calculating a negative log likelihood of a simple two layer neural net in theano and numpy.

This is the numpy code:

W1,b1,W2,b2 = model['W1'], model['b1'], model['W2'], model['b2']
N, D = X.shape

where model is a function that generates initial parameters and X is the input array.

z_1 = X
z_2 = np.dot(X,W1) + b1
z_3 = np.maximum(0, z_2)
z_4 = np.dot(z_3,W2)+b2
scores = z_4
exp_scores = np.exp(scores)
exp_sum = np.sum(exp_scores, axis = 1)
exp_sum.shape = (exp_scores.shape[0],1)
y_hat = exp_scores / exp_sum
loss = np.sum(np.log(y_hat[np.arange(y.shape[0]),y]))
loss = -1/float(y_hat.shape[0])*loss + reg/2.0*np.sum(np.multiply(W1,W1))+ reg/2.0*np.sum(np.multiply(W2,W2))

I'm getting a result of 1.3819194609246772, which is the correct value for the loss function. However my Theano code yields a value of 1.3715655944645178.

t_z_1 = T.dmatrix('z_1')
t_W1 = theano.shared(value = W1, name = 'W1', borrow = True)
t_b1 = theano.shared(value = b1, name = 'b1',borrow = True)
t_W2 =  theano.shared(value = W2, name = 'W2')
t_b2 = theano.shared(value = b2, name = 'b2')
t_y = T.lvector('y')
t_reg = T.dscalar('reg')

first_layer = T.dot(t_z_1,W1) + t_b1
t_hidden = T.switch(first_layer > 0, 0, first_layer)
t_out = T.nnet.softmax(T.dot(t_hidden, W2)+t_b2)
t_cost = -T.mean(T.log(t_out)[T.arange(t_y.shape[0]),t_y],dtype = theano.config.floatX, acc_dtype = theano.config.floatX)+t_reg/2.0*T.sum(T.sqr(t_W1))+t_reg/2.0*T.sum(T.sqr(t_W2))
cost_func = theano.function([t_z_1,t_y,t_reg],t_cost)
loss =  cost_func(z_1,y,reg)

I'm already getting wrong results when calculating the values in the output layer. I'm not really sure what the problem could be. Does the shared function keep the type of numpy array that is used as the value argument or is that converted to float32? Can anybody tell me what I'm doing wrong in the theano code?

EDIT: The problem seems to occur in the hidden layer after applying the ReLU function: Here's the comparison in the results between theano and numpy in each layer:

theano results of first layer
[[-0.3245614  -0.22532614 -0.12609087 -0.0268556   0.07237967  0.17161493
   0.2708502   0.37008547  0.46932074  0.56855601]
 [-0.26107962 -0.14975259 -0.03842555  0.07290148  0.18422852  0.29555556
   0.40688259  0.51820963  0.62953666  0.7408637 ]
 [-0.19759784 -0.07417904  0.04923977  0.17265857  0.29607737  0.41949618
   0.54291498  0.66633378  0.78975259  0.91317139]
 [-0.13411606  0.00139451  0.13690508  0.27241565  0.40792623  0.5434368
   0.67894737  0.81445794  0.94996851  1.08547908]
 [-0.07063428  0.07696806  0.2245704   0.37217274  0.51977508  0.66737742
   0.81497976  0.9625821   1.11018444  1.25778677]]
numpy results of first layer
[[-0.3245614  -0.22532614 -0.12609087 -0.0268556   0.07237967  0.17161493
   0.2708502   0.37008547  0.46932074  0.56855601]
 [-0.26107962 -0.14975259 -0.03842555  0.07290148  0.18422852  0.29555556
   0.40688259  0.51820963  0.62953666  0.7408637 ]
 [-0.19759784 -0.07417904  0.04923977  0.17265857  0.29607737  0.41949618
   0.54291498  0.66633378  0.78975259  0.91317139]
 [-0.13411606  0.00139451  0.13690508  0.27241565  0.40792623  0.5434368
   0.67894737  0.81445794  0.94996851  1.08547908]
 [-0.07063428  0.07696806  0.2245704   0.37217274  0.51977508  0.66737742
   0.81497976  0.9625821   1.11018444  1.25778677]]
theano results of hidden layer
[[-0.3245614  -0.22532614 -0.12609087 -0.0268556   0.          0.          0.
   0.          0.          0.        ]
 [-0.26107962 -0.14975259 -0.03842555  0.          0.          0.          0.
   0.          0.          0.        ]
 [-0.19759784 -0.07417904  0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [-0.13411606  0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [-0.07063428  0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]]
numpy results of hidden layer
[[ 0.          0.          0.          0.          0.07237967  0.17161493
   0.2708502   0.37008547  0.46932074  0.56855601]
 [ 0.          0.          0.          0.07290148  0.18422852  0.29555556
   0.40688259  0.51820963  0.62953666  0.7408637 ]
 [ 0.          0.          0.04923977  0.17265857  0.29607737  0.41949618
   0.54291498  0.66633378  0.78975259  0.91317139]
 [ 0.          0.00139451  0.13690508  0.27241565  0.40792623  0.5434368
   0.67894737  0.81445794  0.94996851  1.08547908]
 [ 0.          0.07696806  0.2245704   0.37217274  0.51977508  0.66737742
   0.81497976  0.9625821   1.11018444  1.25778677]]
theano results of output
[[ 0.14393463  0.2863576   0.56970777]
 [ 0.14303947  0.28582359  0.57113693]
 [ 0.1424154   0.28544871  0.57213589]
 [ 0.14193274  0.28515729  0.57290997]
 [ 0.14171057  0.28502272  0.57326671]]
numpy results of output
[[-0.5328368   0.20031504  0.93346689]
 [-0.59412164  0.15498488  0.9040914 ]
 [-0.67658362  0.08978957  0.85616275]
 [-0.77092643  0.01339997  0.79772637]
 [-0.89110401 -0.08754544  0.71601312]]

I have the idea of using the switch() function for the ReLU layer from this post: Theano HiddenLayer Activation Function and I don't really see how that function is different from the equivalent numpy code: z_3 = np.maximum(0, z_2)?! Solution to first problem: T.switch(first_layer > 0,0,first_layer) sets all the values greater than 0 to 0 => it should be T.switch(first_layer < 0,0,first_layer).

EDIT2: The gradients that theano calculates significantly differ from the numerical gradients I was given, this is my implementation:

g_w1, g_b1, g_w2, g_b2 = T.grad(t_cost,[t_W1,t_b1,t_W2,t_b2])

grads = {}
grads['W1'] = g_w1.eval({t_z_1 : z_1, t_y : y,t_reg : reg})
grads['b1'] = g_b1.eval({t_z_1 : z_1, t_y : y,t_reg : reg})
grads['W2'] = g_w2.eval({t_z_1 : z_1, t_y : y,t_reg : reg})
grads['b2'] = g_b2.eval({t_z_1 : z_1, t_y : y,t_reg : reg})

This is an assignment for the Convolutional Neural Networks class that was offered by Stanford earlier this year and I think it's safe to say that their numerical gradients are probably correct. I could post the code to their numerical implementation though if required.

Using a relative error the following way:

def relative_error(num, ana):
    numerator = np.sum(np.abs(num-ana))
    denom = np.sum(np.abs(num))+np.sum(np.abs(ana))
    return numerator/denom

Calculating the numerical gradients using the eval_numerical_gradient method that was provided by the course get the following relative errors for the gradients:

param_grad_num = {}
rel_error = {}
for param_name in grads:
    param_grad_num[param_name] = eval_numerical_gradient(lambda W: two_layer_net(X, model, y, reg)[0], model[param_name], verbose=False)
    rel_error[param_name] = relative_error(param_grad_num[param_name],grads[param_name])

{'W1': 0.010069468997284833,
 'W2': 0.6628490408291472,
 'b1': 1.9498867941113963e-09,
 'b2': 1.7223972753120753e-11}

Which are too large for W1 and W2, the relative error should be less than 1e-8. Can anybody explain this or help in any way?

eager2learn
  • 1,447
  • 4
  • 24
  • 47
  • 2
    If you include `first_layer`, `t_hidden`, `t_out` as additional outputs of `cost_fun` then you could compare those intermediate Theano values with their numpy equivalents and perhaps narrow down the source of the difference. – Daniel Renshaw Aug 25 '15 at 13:07
  • @DanielRenshaw Thanks for your comment. I have edited my question. Looks as if the problem is caused by the ReLU function: T.switch(first_layer > 0, 0, first_layer). Is that not equivalent to np.maximum(0, z_2)? – eager2learn Aug 25 '15 at 13:39
  • 3
    Nevermind, I just noticed that the switch function returns 0 if the values are positive, so I need to use T.switch(first_layer < 0, 0 , first_layer). Thanks for your help. – eager2learn Aug 25 '15 at 13:46
  • I have encountered another problem this time the gradients calculated by Theano differ by the numerical gradients, see EDIT2. Since the nature of the problem is similar to the first one, I think it might be ok to use this question for it. If not, I'll open a new question. – eager2learn Aug 25 '15 at 14:23

0 Answers0