Global Step for Differential Learning Rate

Question

Based on this question, I am trying to implement differential learning rates as follows:

var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]

#Create Two Separate Optimizers
opt1 = tf.train.AdamOptimizer(0.00001)
opt2 = tf.train.AdamOptimizer(0.0001)

# Compute Gradients for eacch set of variables
grads1, variables1 = zip(*opt1.compute_gradients(loss, var_list1))
grads2, variables2 = zip(*opt2.compute_gradients(loss, var_list2))

# Apply Gradients
train_op1 = opt1.apply_gradients(zip(grads1, variables1))
train_op2 = opt2.apply_gradients(zip(grads2, variables2), global_step=global_step)
train_op = tf.group(train_op1, train_op2)

I am unsure if global_step should be included in each apply_gradients call or if it should only be included in 1? My understanding is that when apply_gradients is called, global_step is incremented by 1 if it is supplied (code here). Based on this, I believe that I should only include global_step in one of my apply_gradients() calls. Can anybody confirm that this is the correct approach?

The alternative to what I have above would be to do the following:

train_op1 = opt1.apply_gradients(zip(grads1, variables1), global_step=global_step)
train_op2 = opt2.apply_gradients(zip(grads2, variables2), global_step=global_step)

While technically each call to apply_gradients is a step, my understanding is that global_step should represent the number of mini-batches that have been completed so if I were to reference it in both apply_gradients() calls then the global step would increase twice per mini-batch. So, based onthis I believe the more accurate implementation would be the first implementation where it is called once. Would others agree this is the correct implementation? Does it matter which apply_gradients() the global_step is included in?

I think adding `global_step` to both should be fine as in [this example](https://stackoverflow.com/questions/47156113/how-to-alternate-train-ops-in-tensorflow). — Guillem Xercavins, Feb 22 '18 at 13:01
In that example, they alternate between the two optimizers so the global_step would only be called once on each step. — reese0106, Feb 22 '18 at 16:00

Global Step for Differential Learning Rate

0 Answers0