I know that optimizers in Tensorflow divide minimize
into compute_gradients
and apply_gradients
. However, optimization algorithms like Adam generally process the gradients with momentum and some other techniques as the following figure suggests(Thanks @kmario23 for providing the figure).
I wonder when these techniques are applied to the gradients? Are they applied in
compute_gradients
or apply_gradients
?
Update
sess = tf.Session()
x = tf.placeholder(tf.float32, [None, 1])
y = tf.layers.dense(x, 1)
loss = tf.losses.mean_squared_error(tf.ones_like(y), y)
opt = tf.train.AdamOptimizer()
grads = opt.compute_gradients(loss)
sess.run(tf.global_variables_initializer())
print(sess.run(grads, feed_dict={x: [[1]]}))
print(sess.run(grads, feed_dict={x: [[1]]}))
The above code outputs the same results twice, does it suggest that moment estimates are computed in apply_gradients
? Because, IMHO, if moment estimates are computed in apply_gradients
, then after the first print
statement, first and second moments will be updated, which should result in different result in the second print
statement.