MNIST Tensorflow vs code from Michael Nielsen

Question

I read Michael Nielsen's book neuralnetworksanddeeplearning.com about Neural networks. He always does the example with the MNIST data. I now took his code and designed exactly the same network in Tensorflow, but I realized that the results in Tensorflow are not the same(they are much worser).

Here are the details:

1) The code from Michael Nielsen can be found at https://github.com/kanban1992/MNIST_Comparison/tree/master/Michael_Nielsen. you can start everything with

python start_2.py

The network has

3 hidden layers a 30 neurons.
all activation functions are sigmoids
I use Stochastic Gradient Descent (learning rate 3.0) with Backpropagation. The batch size is 10
A quadratic cost function without any regularization is used.
The weight matrix which connects layer l and l+1 are initialized with a gaussian probability density with stddev=1/sqrt(Number of neurons in layer l) and mu=0.0. The biases are initialized with standard normal distribution.
after training for 5 epochs I get 95 % of the images in the validation set classified correct.

This approach has to be correct, because it works well and I did not modify it!

2) The tensorflow implementation was done by me and has exactly the same structure as the Nielsen net described above in point 1). The full code can be found at https://github.com/kanban1992/MNIST_Comparison/tree/master/tensorflow and run with

python start_train.py

With the tensorflow approach I get a accuracy of 10% (that would be the same as random guessing!) So something is not working and I have no idea what!?

Here is a snippet of the most important part of the code:

x_training,y_training,x_validation,y_validation,x_test,y_test = mnist_loader.load_data_wrapper()

N_training=len(x_training)
N_validation=len(x_validation)
N_test=len(x_test)


N_epochs = 5

learning_rate = 3.0 
batch_size = 10


N1 = 784 #equals N_inputs
N2 = 30
N3 = 30
N4 = 30
N5 = 10

N_in=N1
N_out=N5

x = tf.placeholder(tf.float32,[None,N1])#don't take the shape=(batch_size,N1) argument, because we need this for different batch sizes

W2 = tf.Variable(tf.random_normal([N1, N2],mean=0.0,stddev=1.0/math.sqrt(N1*1.0)))# Initialize the weights for one neuron with 1/sqrt(Number of weights which enter the neuron/ Number of neurons in layer before)
b2 = tf.Variable(tf.random_normal([N2]))
a2 = tf.sigmoid(tf.matmul(x, W2) + b2) #x=a1

W3 = tf.Variable(tf.random_normal([N2, N3],mean=0.0,stddev=1.0/math.sqrt(N2*1.0)))
b3 = tf.Variable(tf.random_normal([N3]))
a3 = tf.sigmoid(tf.matmul(a2, W3) + b3)

W4 = tf.Variable(tf.random_normal([N3, N4],mean=0.0,stddev=1.0/math.sqrt(N3*1.0)))
b4 = tf.Variable(tf.random_normal([N4]))
a4 = tf.sigmoid(tf.matmul(a3, W4) + b4)

W5 = tf.Variable(tf.random_normal([N4, N5],mean=0.0,stddev=1.0/math.sqrt(N4*1.0)))
b5 = tf.Variable(tf.random_normal([N5]))
y = tf.sigmoid(tf.matmul(a4, W5) + b5)

y_ = tf.placeholder(tf.float32,[None,N_out]) #  ,shape=(batch_size,N_out)


quadratic_cost= tf.scalar_mul(1.0/(N_training*2.0),tf.reduce_sum(tf.squared_difference(y,y_))) 

train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(quadratic_cost)
init = tf.initialize_all_variables()

#launch the graph
sess = tf.Session()
sess.run(init)


#batch size of training input
N_training_batch=N_training/batch_size #rounds to samllest integer

correct=[0]*N_epochs
cost_training_data=[0.0]*N_epochs

for i in range(0,N_epochs):
    for j in range(0,N_training_batch):
        start=j*batch_size
        end=(j+1)*batch_size
        batch_x=x_training[start:end]
        batch_y=y_training[start:end]

        sess.run(train_step, feed_dict={x: batch_x, 
        y_: batch_y})

    perm = np.arange(N_training)
    np.random.shuffle(perm)
    x_training = x_training[perm]
    y_training = y_training[perm]


    #cost after each epoch
    cost_training_data[i]=sess.run(quadratic_cost, feed_dict={x: x_training, 
        y_: y_training})
    #correct predictions after each epoch
    y_out_validation=sess.run(y,feed_dict={x: x_validation})
    for k in range(0,len(y_out_validation)):
        arg=np.argmax(y_out_validation[k])
        if 1.0==y_validation[k][arg]:
            correct[i]+=1

    print "correct after "+str(i)+ " epochs: "+str(correct[i])

It would be really great if you could tell me what's going wrong :-)

score 1 · Answer 1 · answered Jun 18 '16 at 17:31

1

Your learning rate seems to high for Gradient Decent. Try a number more like .0001. Raise or lower from there.

I like the Adam optimizer, make sure you start with a smaller learning rate (.001 I think is the default for Adam):

optimizer = tf.train.AdamOptimizer(learning_rate)

answered Jun 18 '16 at 17:31

mazecreator

543
1
11
27

The point is, that I want to reproduce exactly the Nielsen net to have a cross check that everything is working fine. If I choose GradientDescent with learnng_rate=0.001 I also get only an accuracy of 10 %. If I choose AdamOptimizer with learning rate 0.001 I get an accuracy of 95%, but that does not solve my crosscheck problem – jojo123456 Jun 18 '16 at 17:45
Maybe there's a bug in Nielsen's version that makes it work with such high error rate? TensorFlow gradient descent went through a lot of testing – Yaroslav Bulatov Jun 18 '16 at 17:47
Nielsens net works fine for the mentioned setup, but tensorflow is bad. I think there is a bug in the tensorflow GradientDescentOptimizer – jojo123456 Jun 18 '16 at 17:49
1

For GradientDescent make sure you try different Learning Rates, try in steps of power of ten in both directions. I suspect you need to decrease it as you are still getting the 10% at .0001. – mazecreator Jun 18 '16 at 18:32
Digging deeper into Nielsen's code, lmbda=0.0, in the Gradient Descent function (update_mini_batch ~line 205), the weight is adjusted by "(1-eta*(lmbda/n))*w". Since lmbda=0, eta which is set to 3.0 is multiplied by 0.0 meaning you can set eta to any value to get the same results. This certainly indicates his implementation is different than TensorFlow. – mazecreator Jun 18 '16 at 19:02
Yo have self.weights = [(1-eta*(lmbda/n))*w-(eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)] but if lmnda=0.0 this does not mean that this is independent of eta. There is still an eta. – jojo123456 Jun 18 '16 at 20:30
1

Only on the 2nd term, agreed. What I am telling you is eta and the TensorFlow learning rate are not the same value. Explore GD with different learning rates in TensorFlow. – mazecreator Jun 18 '16 at 21:54
Okay, I think here it is defined: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/training_ops.cc#L46, this would mean if I compare it with the Nielsen Formula, that eta/len(mini_batch)=learning_rate_tensorflow. That means I have to choose learning_rate tensorflow=0.3 and eta=3.0, but that did not work... – jojo123456 Jun 19 '16 at 11:00

MNIST Tensorflow vs code from Michael Nielsen

1 Answers1