tf.gradients() sums over ys, does it?

Question

https://www.tensorflow.org/versions/r1.6/api_docs/python/tf/gradients

In the documentation for tf.gradients(ys, xs) it states that

Constructs symbolic derivatives of sum of ys w.r.t. x in xs

I am confused about the summing part, I have read elsewhere that this sums the derivatives dy/dx across the batch for every x in the batch. However, whenever I use this I fail to see this happening. Take the following simple example:

x_dims = 3
batch_size = 4

x = tf.placeholder(tf.float32, (None, x_dims))

y = 2*(x**2)

grads = tf.gradients(y,x)

sess = tf.Session()

x_val = np.random.randint(0, 10, (batch_size, x_dims))
y_val, grads_val = sess.run([y, grads], {x:x_val})

print('x = \n', x_val)
print('y = \n', y_val)
print('dy/dx = \n', grads_val[0])

This gives the following output:

x = 
 [[5 3 7]
 [2 2 5]
 [7 5 0]
 [3 7 6]]
y = 
 [[50. 18. 98.]
 [ 8.  8. 50.]
 [98. 50.  0.]
 [18. 98. 72.]]
dy/dx = 
 [[20. 12. 28.]
 [ 8.  8. 20.]
 [28. 20.  0.]
 [12. 28. 24.]]

This is the output I would expect, simply the derivative dy/dx for every element in the batch. I don't see any summing happening. I have seen in other examples that this operation is followed by dividing by the batch size to account for tf.gradients() summing the gradients over the batch (see here: https://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html). Why is this necessary?

I am using Tensorflow 1.6 and Python 3.

To give more insight over why you see the gradients summed over batch size in methods like DDPG: those gradients haven't come from computing gradients through a loss function which has already accounted for this (like `tf.reduce_mean(...)`). The gradients have been summed by `tf.gradients`, so dividing by the batch size gives a mean gradient for the batch when applied with `apply_gradients` — parrowdice, Jan 28 '20 at 00:01

score 1 · Accepted Answer · answered Aug 15 '18 at 16:03

If y and x have the same shape then the sum over the dy/dx is the sum over exactly one value. However, if you have more than one y for each x, then the gradients are summed.

import numpy as np
import tensorflow as tf

x_dims = 3
batch_size = 4

x = tf.placeholder(tf.float32, (None, x_dims))
y = 2*(x**2)
z = tf.stack([y, y]) # There are twice as many z's as x's

dy_dx = tf.gradients(y,x)
dz_dx = tf.gradients(z,x)

sess = tf.Session()

x_val = np.random.randint(0, 10, (batch_size, x_dims))
y_val, z_val, dy_dx_val, dz_dx_val = sess.run([y, z, dy_dx, dz_dx], {x:x_val})

print('x.shape =', x_val.shape)
print('x = \n', x_val)
print('y.shape = ', y_val.shape)
print('y = \n', y_val)
print('z.shape = ', z_val.shape)
print('z = \n', z_val)
print('dy/dx = \n', dy_dx_val[0])
print('dz/dx = \n', dz_dx_val[0])

Produces the following output:

x.shape = (4, 3)
x = 
 [[1 4 8]
 [0 2 8]
 [2 8 1]
 [4 5 2]]

y.shape =  (4, 3)
y = 
 [[  2.  32. 128.]
 [  0.   8. 128.]
 [  8. 128.   2.]
 [ 32.  50.   8.]]

z.shape =  (2, 4, 3)
z = 
 [[[  2.  32. 128.]
  [  0.   8. 128.]
  [  8. 128.   2.]
  [ 32.  50.   8.]]

 [[  2.  32. 128.]
  [  0.   8. 128.]
  [  8. 128.   2.]
  [ 32.  50.   8.]]]

dy/dx = 
 [[ 4. 16. 32.]
 [ 0.  8. 32.]
 [ 8. 32.  4.]
 [16. 20.  8.]]
dz/dx = 
 [[ 8. 32. 64.]
 [ 0. 16. 64.]
 [16. 64.  8.]
 [32. 40. 16.]]

In particular, notice that the values of dz/dx are twice those of dy/dz since they are summed over the inputs to the stack.

tf.gradients() sums over ys, does it?

1 Answers1