I'm getting different results when calculating a negative log likelihood of a simple two layer neural net in theano and numpy.
This is the numpy code:
W1,b1,W2,b2 = model['W1'], model['b1'], model['W2'], model['b2']
N, D = X.shape
where model is a function that generates initial parameters and X is the input array.
z_1 = X
z_2 = np.dot(X,W1) + b1
z_3 = np.maximum(0, z_2)
z_4 = np.dot(z_3,W2)+b2
scores = z_4
exp_scores = np.exp(scores)
exp_sum = np.sum(exp_scores, axis = 1)
exp_sum.shape = (exp_scores.shape[0],1)
y_hat = exp_scores / exp_sum
loss = np.sum(np.log(y_hat[np.arange(y.shape[0]),y]))
loss = -1/float(y_hat.shape[0])*loss + reg/2.0*np.sum(np.multiply(W1,W1))+ reg/2.0*np.sum(np.multiply(W2,W2))
I'm getting a result of 1.3819194609246772, which is the correct value for the loss function. However my Theano code yields a value of 1.3715655944645178.
t_z_1 = T.dmatrix('z_1')
t_W1 = theano.shared(value = W1, name = 'W1', borrow = True)
t_b1 = theano.shared(value = b1, name = 'b1',borrow = True)
t_W2 = theano.shared(value = W2, name = 'W2')
t_b2 = theano.shared(value = b2, name = 'b2')
t_y = T.lvector('y')
t_reg = T.dscalar('reg')
first_layer = T.dot(t_z_1,W1) + t_b1
t_hidden = T.switch(first_layer > 0, 0, first_layer)
t_out = T.nnet.softmax(T.dot(t_hidden, W2)+t_b2)
t_cost = -T.mean(T.log(t_out)[T.arange(t_y.shape[0]),t_y],dtype = theano.config.floatX, acc_dtype = theano.config.floatX)+t_reg/2.0*T.sum(T.sqr(t_W1))+t_reg/2.0*T.sum(T.sqr(t_W2))
cost_func = theano.function([t_z_1,t_y,t_reg],t_cost)
loss = cost_func(z_1,y,reg)
I'm already getting wrong results when calculating the values in the output layer. I'm not really sure what the problem could be. Does the shared function keep the type of numpy array that is used as the value argument or is that converted to float32? Can anybody tell me what I'm doing wrong in the theano code?
EDIT: The problem seems to occur in the hidden layer after applying the ReLU function: Here's the comparison in the results between theano and numpy in each layer:
theano results of first layer
[[-0.3245614 -0.22532614 -0.12609087 -0.0268556 0.07237967 0.17161493
0.2708502 0.37008547 0.46932074 0.56855601]
[-0.26107962 -0.14975259 -0.03842555 0.07290148 0.18422852 0.29555556
0.40688259 0.51820963 0.62953666 0.7408637 ]
[-0.19759784 -0.07417904 0.04923977 0.17265857 0.29607737 0.41949618
0.54291498 0.66633378 0.78975259 0.91317139]
[-0.13411606 0.00139451 0.13690508 0.27241565 0.40792623 0.5434368
0.67894737 0.81445794 0.94996851 1.08547908]
[-0.07063428 0.07696806 0.2245704 0.37217274 0.51977508 0.66737742
0.81497976 0.9625821 1.11018444 1.25778677]]
numpy results of first layer
[[-0.3245614 -0.22532614 -0.12609087 -0.0268556 0.07237967 0.17161493
0.2708502 0.37008547 0.46932074 0.56855601]
[-0.26107962 -0.14975259 -0.03842555 0.07290148 0.18422852 0.29555556
0.40688259 0.51820963 0.62953666 0.7408637 ]
[-0.19759784 -0.07417904 0.04923977 0.17265857 0.29607737 0.41949618
0.54291498 0.66633378 0.78975259 0.91317139]
[-0.13411606 0.00139451 0.13690508 0.27241565 0.40792623 0.5434368
0.67894737 0.81445794 0.94996851 1.08547908]
[-0.07063428 0.07696806 0.2245704 0.37217274 0.51977508 0.66737742
0.81497976 0.9625821 1.11018444 1.25778677]]
theano results of hidden layer
[[-0.3245614 -0.22532614 -0.12609087 -0.0268556 0. 0. 0.
0. 0. 0. ]
[-0.26107962 -0.14975259 -0.03842555 0. 0. 0. 0.
0. 0. 0. ]
[-0.19759784 -0.07417904 0. 0. 0. 0. 0.
0. 0. 0. ]
[-0.13411606 0. 0. 0. 0. 0. 0.
0. 0. 0. ]
[-0.07063428 0. 0. 0. 0. 0. 0.
0. 0. 0. ]]
numpy results of hidden layer
[[ 0. 0. 0. 0. 0.07237967 0.17161493
0.2708502 0.37008547 0.46932074 0.56855601]
[ 0. 0. 0. 0.07290148 0.18422852 0.29555556
0.40688259 0.51820963 0.62953666 0.7408637 ]
[ 0. 0. 0.04923977 0.17265857 0.29607737 0.41949618
0.54291498 0.66633378 0.78975259 0.91317139]
[ 0. 0.00139451 0.13690508 0.27241565 0.40792623 0.5434368
0.67894737 0.81445794 0.94996851 1.08547908]
[ 0. 0.07696806 0.2245704 0.37217274 0.51977508 0.66737742
0.81497976 0.9625821 1.11018444 1.25778677]]
theano results of output
[[ 0.14393463 0.2863576 0.56970777]
[ 0.14303947 0.28582359 0.57113693]
[ 0.1424154 0.28544871 0.57213589]
[ 0.14193274 0.28515729 0.57290997]
[ 0.14171057 0.28502272 0.57326671]]
numpy results of output
[[-0.5328368 0.20031504 0.93346689]
[-0.59412164 0.15498488 0.9040914 ]
[-0.67658362 0.08978957 0.85616275]
[-0.77092643 0.01339997 0.79772637]
[-0.89110401 -0.08754544 0.71601312]]
I have the idea of using the switch() function for the ReLU layer from this post: Theano HiddenLayer Activation Function and I don't really see how that function is different from the equivalent numpy code: z_3 = np.maximum(0, z_2)?! Solution to first problem: T.switch(first_layer > 0,0,first_layer) sets all the values greater than 0 to 0 => it should be T.switch(first_layer < 0,0,first_layer).
EDIT2: The gradients that theano calculates significantly differ from the numerical gradients I was given, this is my implementation:
g_w1, g_b1, g_w2, g_b2 = T.grad(t_cost,[t_W1,t_b1,t_W2,t_b2])
grads = {}
grads['W1'] = g_w1.eval({t_z_1 : z_1, t_y : y,t_reg : reg})
grads['b1'] = g_b1.eval({t_z_1 : z_1, t_y : y,t_reg : reg})
grads['W2'] = g_w2.eval({t_z_1 : z_1, t_y : y,t_reg : reg})
grads['b2'] = g_b2.eval({t_z_1 : z_1, t_y : y,t_reg : reg})
This is an assignment for the Convolutional Neural Networks class that was offered by Stanford earlier this year and I think it's safe to say that their numerical gradients are probably correct. I could post the code to their numerical implementation though if required.
Using a relative error the following way:
def relative_error(num, ana):
numerator = np.sum(np.abs(num-ana))
denom = np.sum(np.abs(num))+np.sum(np.abs(ana))
return numerator/denom
Calculating the numerical gradients using the eval_numerical_gradient method that was provided by the course get the following relative errors for the gradients:
param_grad_num = {}
rel_error = {}
for param_name in grads:
param_grad_num[param_name] = eval_numerical_gradient(lambda W: two_layer_net(X, model, y, reg)[0], model[param_name], verbose=False)
rel_error[param_name] = relative_error(param_grad_num[param_name],grads[param_name])
{'W1': 0.010069468997284833,
'W2': 0.6628490408291472,
'b1': 1.9498867941113963e-09,
'b2': 1.7223972753120753e-11}
Which are too large for W1 and W2, the relative error should be less than 1e-8. Can anybody explain this or help in any way?