8

This is a possible duplicate of Tensorflow: How to get gradients per instance in a batch?. I ask it anyway, because there has not been a satisfying answer and the goal here is a bit different.

I have a very big network that I can fit on my GPU but the max batch size I can feed is 32. Anything bigger than that causes the GPU to run out of memory. I want to use a bigger batch in order to get a more accurate approximation of the gradient.

For concreteness, let's say I want to compute the gradient on a big batch of size 96 by feeding 3 batches of 32 in turn. The best way that I know of is to use Optimizer.compute_gradients() and Optimizer.apply_gradients(). Here is a small example how it can work

import tensorflow as tf
import numpy as np

learn_rate = 0.1

W_init = np.array([ [1,2,3], [4,5,6], [7,8,9] ], dtype=np.float32)
x_init = np.array([ [11,12,13], [14,15,16], [17,18,19] ], dtype=np.float32)

X = tf.placeholder(dtype=np.float32, name="x")
W = tf.Variable(W_init, dtype=np.float32, name="w")
y = tf.matmul(X, W, name="y")
loss = tf.reduce_mean(y, name="loss")

opt = tf.train.GradientDescentOptimizer(learn_rate)
grad_vars_op = opt.compute_gradients(loss)

sess = tf.Session()
sess.run(tf.global_variables_initializer())

# Compute the gradients for each batch
grads_vars1 = sess.run(grad_vars_op, feed_dict = {X: x_init[None,0]})
grads_vars2 = sess.run(grad_vars_op, feed_dict = {X: x_init[None,1]})
grads_vars3 = sess.run(grad_vars_op, feed_dict = {X: x_init[None,2]})

# Separate the gradients from the variables
grads1 = [ grad for grad, var in grads_vars1 ]
grads2 = [ grad for grad, var in grads_vars2 ]
grads3 = [ grad for grad, var in grads_vars3 ]
varl   = [ var  for grad, var in grads_vars1 ]

# Average the gradients
grads  = [ (g1 + g2 + g3)/3 for g1, g2, g3 in zip(grads1, grads2, grads3)]

sess.run(opt.apply_gradients(zip(grads,varl)))

print("Weights after 1 gradient")
print(sess.run(W))

Now this is all very ugly and inefficient since the forward pass is being run on the GPU while averaging the gradients happens on the CPU and then applying them happens on the GPU again.

Moreover, this code throws an exception because grads is a list of np.arrays and to make it work, one would have to create a tf.placeholder for every gradient.

I am sure there should be a better and more efficient way to do this? Any suggestions?

Engineero
  • 12,340
  • 5
  • 53
  • 75
niko
  • 1,128
  • 1
  • 11
  • 25

1 Answers1

12

You can create copy of trainable_variables and accumulate batch gradients. Here's few simple steps to follow

...
opt = tf.train.GradientDescentOptimizer(learn_rate)

# constant to scale sum of gradient
const = tf.constant(1/n_batches)
# get all trainable variables
t_vars = tf.trainable_variables()
# create a copy of all trainable variables with `0` as initial values
accum_tvars = [tf.Variable(tf.zeros_like(tv.initialized_value()),trainable=False) for t_var in t_vars]                                        
# create a op to initialize all accums vars
zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_tvars]

# compute gradients for a batch
batch_grads_vars = opt.compute_gradients(loss, t_vars)
# collect the (scaled by const) batch gradient into accumulated vars 
accum_ops = [accum_tvars[i].assign_add(tf.scalar_mul(const, batch_grad_var[0]) for i, batch_grad_var in enumerate(batch_grads_vars)]

# apply accums gradients 
train_step = opt.apply_gradients([(accum_tvars[i], batch_grad_var[1]) for i, batch_grad_var in enumerate(batch_grads_vars)])
# train_step = opt.apply_gradients(zip(accum_tvars, zip(*batch_grads_vars)[1])

while True:
   # initialize the accumulated gards
   sess.run(zero_ops)

   # number of batches for gradient accumulation 
   n_batches = 3
   for i in xrange(n_batches):
       sess.run(accum_ops, feed_dict={X: x_init[:, i]})

   sess.run(train_step)
rugrag
  • 163
  • 1
  • 1
  • 7
Ishant Mrinal
  • 4,898
  • 3
  • 29
  • 47
  • nice solution. Would be slightly more pythonic to do a zip in both train_step and train_step list comprehensions instead of enumerate and indexing (and probably be more readable too). – lejlot Aug 31 '17 at 20:27
  • nice solution indeed. Am I correct that all operations will be executed on the GPU? – niko Aug 31 '17 at 23:10
  • `assign_op` dependent on where your variables are defined, cpu/gpu. you can compute rest of them on gpus. – Ishant Mrinal Sep 01 '17 at 04:22
  • 3
    Nice solution! But looks there should be one more step for averaging the gradients. – Ruofan Kong Oct 02 '17 at 20:12
  • 2
    Two fairly critical problems: 1. This doesn't work in general: if you are using anything that acts over a batch (like BatchNorm) then it's not mathematically equivalent. 2. I wrote some code based on this idea and it doesn't seem to actually work, despite accurately replicating the gradients. https://gist.github.com/Multihuntr/b8cb68316842ff68cab3062740a2a730 I don't think I've made any logic errors. – Multihunter Nov 08 '17 at 02:21