Adam in Tensorflow: where does moment estimates happen?

Question

I know that optimizers in Tensorflow divide minimize into compute_gradients and apply_gradients. However, optimization algorithms like Adam generally process the gradients with momentum and some other techniques as the following figure suggests(Thanks @kmario23 for providing the figure). I wonder when these techniques are applied to the gradients? Are they applied in compute_gradients or apply_gradients?

Update

sess = tf.Session()
x = tf.placeholder(tf.float32, [None, 1])
y = tf.layers.dense(x, 1)
loss = tf.losses.mean_squared_error(tf.ones_like(y), y)
opt = tf.train.AdamOptimizer()
grads = opt.compute_gradients(loss)
sess.run(tf.global_variables_initializer())
print(sess.run(grads, feed_dict={x: [[1]]}))
print(sess.run(grads, feed_dict={x: [[1]]}))

The above code outputs the same results twice, does it suggest that moment estimates are computed in apply_gradients? Because, IMHO, if moment estimates are computed in apply_gradients, then after the first print statement, first and second moments will be updated, which should result in different result in the second printstatement.

score 2 · Answer 1 · answered Jan 18 '19 at 03:57

2

Below is the Adam algorithm as presented in the Deep Learning book. As for your question, the important thing to note here is the gradient of theta (written as Laplacian of theta) in second to last step.

As for how TensorFlow computes this is a two step process in the optimization (i.e. minimization)

1) compute_gradients
2) apply_gradients

In the first step all the necessary ingredients for the final gradients are computed. So, the second step is just applying the update to the parameters based on the gradients computed in the first step and the learning rate (lr).

answered Jan 18 '19 at 03:57

kmario23

57,311
13
161
150

Hi, I've tested your answer by running `compute_gradients` twice and checking their output gradients. Both output the same results. If moment estimates in your figure are updated in the `compute_gradients`, then `hat s` and `hat r` should be updated after the first call, which, IMHO, should result in different `Delta theta` in the second call. Why would two gradients still be the same? – Maybe Jan 18 '19 at 12:43

score 1 · Accepted Answer · answered Jan 24 '19 at 09:28

compute_gradients computes only gradients, all other additional operations corresponding to specific optimization algorithms are done in apply_gradients. The code in the update is one evidence, another evidence is the following figure cropped from tensorboard, where Adam corresponds to the compute_gradient operation.

Adam in Tensorflow: where does moment estimates happen?

Update

2 Answers2