Using different learning rates for different variables in TensorFlow

Question

Is it possible to set different learning rates for different variables in the same layer in TensorFlow?

For example, in a dense layer, how can you set a learning rate of 0.001 for the kernel while setting the learning rate for the bias to be 0.005?

One solution is to divide the layer into 2 layers. In one layer you only train the kernel (with a non-trainable 0 bias) and in the other one you only train the bias (with a non-trainable identity kernel). This way one can use tfa.optimizers.MultiOptimzer to set different learning rates for the two layers. But this slightly slows down the training, because now the training of the bias and the kernel is not parallelised. So, I'm wondering if there is an standard way of setting different learning rates for different variables in the same layer in TF?

What you described ("train the kernel (with a non-trainable 0 bias) and in the other one you only train the bias") is to decouple the weights of a given layer. From the docs of `tfa.optimizers.MultiOptimzer` it seems like "Each optimizer will optimize only the weights associated with its paired layer." So, it can treat different layers (not weights of a given layer) independently from each other. — learner, Feb 17 '23 at 22:23

score 0 · Answer 1 · answered Jul 28 '23 at 13:55

This should be possible using custom training loops and multiple optimizers.

First instantiate different optimizers, one for each set of variables (assuming a custom layer with 3 distinct sets of variables for which we want different learning rates when updating them):

optim_A = tf.keras.optimizers.SGD(learning_rate=0.1)
optim_B = tf.keras.optimizers.SGD(learning_rate=0.01)
optim_C = tf.keras.optimizers.SGD(learning_rate=0.001)

Then create the custom training loop:

epochs = 10
for epoch in range(epochs):
    for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
        with tf.GradientTape() as tape:
            logits = model(x_batch_train, training=True)

            loss_value = loss_fn(y_batch_train, logits)

        grads = tape.gradient(loss_value, model.trainable_variables)

        for optimizer, var in zip([optim_A, optim_B, optim_C], [0, 1, 2]):
            optimizer.apply_gradients([(grads[var], model.trainable_variables[var])])

A more detailed guide on custom training loops can be found here. The idea with the different optimizers originates from here.

Using different learning rates for different variables in TensorFlow

1 Answers1