ExponentialDecay learning rate schedule with 'staircase=True' changes the training behavior even before it should become effective

Question

When adding an ExponentialDecay learning rate schedule to my Adam optimizer, it changed the training behavior even before it should become effective. I used the following definition for the schedule:

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    1e-3, decay_steps=25, decay_rate=0.95, staircase=True)

Since I'm using staircase=True, there should be no difference for the first 25 epochs compared to using a static learning rate of the same value. So the following two optimizers should yield identical training results for the first 25 epochs:

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)

However I observed that the behavior is different already before:

This is the test code I used:

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, Dropout

np.random.seed(0)

x_data = 2*np.random.random(size=(1000, 1))
y_data = np.random.normal(loc=x_data**2, scale=0.05)

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    1e-3, decay_steps=25, decay_rate=0.95, staircase=True)

histories = []
learning_rates = [1e-3, lr_schedule]
for lr in learning_rates:
    tf.random.set_seed(0)

    model = tf.keras.models.Sequential([
        Dense(10, activation='tanh', input_dim=1), Dropout(0.2),
        Dense(10, activation='tanh'), Dropout(0.2),
        Dense(1)
    ])
    optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
    model.compile(optimizer=optimizer, loss='mse')
    history = model.fit(x_data, y_data, epochs=50)
    histories.append(history.history['loss'])

fig, ax = plt.subplots()
ax.set(xlabel='Epoch', ylabel='Loss')
ax.plot(histories[0], label='Static learning rate')
ax.plot(histories[1], label='Learning rate schedule')
ax.legend()
plt.show()

I'm using Python 3.7.9 and the following install of Tensorflow:

$ conda list | grep tensorflow
tensorflow                2.1.0           mkl_py37h80a91df_0  
tensorflow-base           2.1.0           mkl_py37h6d63fb7_0  
tensorflow-estimator      2.1.0              pyhd54b08b_0

qedk · Accepted Answer · 2021-01-08T18:35:59.527

When using ExponentialDecay, what you're basically doing is to make a decayed learning rate like:

def decayed_learning_rate(step):
  return initial_learning_rate * decay_rate ^ (step / decay_steps)

When you set staircase=True, what happens is that step / decay_steps is an integer division and the rate follows a staircase function. Now, let's take a look at the source code:

# ...setup for step function...
global_step_recomp = math_ops.cast(step, dtype) # step is the current step count
p = global_step_recomp / decay_steps
if self.staircase:
  p = math_ops.floor(p)
return math_ops.multiply(initial_learning_rate, math_ops.pow(decay_rate, p), name=name)

And we can see that we have a variable p that updates at every multiple of decay_steps, so at step 25, 50, 75 and so on... Basically, the learning rate is constant for every 25 steps, not epochs - which is why it updates before the first 25 epochs. A good explanation on the difference can be read at What is the difference between steps and epochs in TensorFlow?

score 2 · Answer 2 · edited Dec 15 '21 at 10:01

2

The decay_steps paramater in ExponentialDecay does not mean number of epochs, but number of steps (training on a single batch). If you want the learning rate to start decaying at 25th epoch, this parameter should be 25 * (num_samples_of_whole_dataset / batch_size).

edited Dec 15 '21 at 10:01

ouflak

2,458
10
44
49

answered Dec 15 '21 at 09:50

JeremyQiu

51
1
3

ExponentialDecay learning rate schedule with 'staircase=True' changes the training behavior even before it should become effective

2 Answers2