Does Stochastic Gradient Descent even work with TensorFlow?

Question

I designed a MLP, fully connected, with 2 hidden and one output layer. I get a nice learning curve if I use batch or mini-batch gradient descent.

But a straight line while performing Stochastic Gradient Descent (violet)

What did I get wrong?

In my understanding, I do stochastic gradient descent with Tensorflow, if I provide just one train/learn example each train step, like:

X = tf.placeholder("float", [None, amountInput],name="Input")
Y = tf.placeholder("float", [None, amountOutput],name="TeachingInput")
...
m, i = sess.run([merged, train_op], feed_dict={X:[input],Y:[label]})

Whereby input is a 10-component vector and label is a 20-component vector.

For testings I run 1000 iterations, each iterations contains one of 50 prepared train/learn example. I expected an overfittet nn. But as you see, it doesn't learn :(

Because the nn will perform in an online-learning environment, a mini-batch oder batch gradient descent isn't an option.

thanks for any hints.

score 5 · Accepted Answer · answered Dec 19 '16 at 16:46

The batch size influences the effective learning rate.

If you think to the update formula of a single parameter, you'll see that it's updated averaging the various values computed for this parameter, for every element in the input batch.

This means that if you're working with a batch size with size n, your "real" learning rate per single parameter is about learning_rate/n.

Thus, if the model you've trained with batches of size n have trained without issues, this is because the learning rate was ok for that batch size.

If you use pure stochastic gradient descent, you have to lower the learning rate (usually by a factor of some power of 10).

So, for example, if your learning rate was 1e-4 with a batch size of 128, try with a learning rate of 1e-4 / 128.0 as see if the network learn (it should).

I'm confused, where one can see "averaging" of values? I've done testing for `y = b0 + b1 * x` function. I can output gradient sum w.r.t. each parameter. On each step delta(b0) = learning_rate * grad_sum(b0). The sum is higher, the more values in a batch. That is for batch mode I get HIGHER parameter change in each step. In fact, on one step, feeding a batch, the parameter change equals the sum of changes, if splitting this batch and feeding one sample per step. I can share my outputs if needed. Please help in understanding this. — noname7619, Feb 05 '17 at 09:32

Does Stochastic Gradient Descent even work with TensorFlow?

1 Answers1

Linked