Based on this question, I am trying to implement differential learning rates as follows:
var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
#Create Two Separate Optimizers
opt1 = tf.train.AdamOptimizer(0.00001)
opt2 = tf.train.AdamOptimizer(0.0001)
# Compute Gradients for eacch set of variables
grads1, variables1 = zip(*opt1.compute_gradients(loss, var_list1))
grads2, variables2 = zip(*opt2.compute_gradients(loss, var_list2))
# Apply Gradients
train_op1 = opt1.apply_gradients(zip(grads1, variables1))
train_op2 = opt2.apply_gradients(zip(grads2, variables2), global_step=global_step)
train_op = tf.group(train_op1, train_op2)
I am unsure if global_step
should be included in each apply_gradients
call or if it should only be included in 1? My understanding is that when apply_gradients is called, global_step
is incremented by 1 if it is supplied (code here). Based on this, I believe that I should only include global_step
in one of my apply_gradients()
calls. Can anybody confirm that this is the correct approach?
The alternative to what I have above would be to do the following:
train_op1 = opt1.apply_gradients(zip(grads1, variables1), global_step=global_step)
train_op2 = opt2.apply_gradients(zip(grads2, variables2), global_step=global_step)
While technically each call to apply_gradients
is a step, my understanding is that global_step should represent the number of mini-batches that have been completed so if I were to reference it in both apply_gradients()
calls then the global step would increase twice per mini-batch. So, based onthis I believe the more accurate implementation would be the first implementation where it is called once. Would others agree this is the correct implementation? Does it matter which apply_gradients()
the global_step is included in?