5

When adding an ExponentialDecay learning rate schedule to my Adam optimizer, it changed the training behavior even before it should become effective. I used the following definition for the schedule:

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    1e-3, decay_steps=25, decay_rate=0.95, staircase=True)

Since I'm using staircase=True, there should be no difference for the first 25 epochs compared to using a static learning rate of the same value. So the following two optimizers should yield identical training results for the first 25 epochs:

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)

However I observed that the behavior is different already before:

Example loss curve

This is the test code I used:

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, Dropout

np.random.seed(0)

x_data = 2*np.random.random(size=(1000, 1))
y_data = np.random.normal(loc=x_data**2, scale=0.05)

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    1e-3, decay_steps=25, decay_rate=0.95, staircase=True)

histories = []
learning_rates = [1e-3, lr_schedule]
for lr in learning_rates:
    tf.random.set_seed(0)

    model = tf.keras.models.Sequential([
        Dense(10, activation='tanh', input_dim=1), Dropout(0.2),
        Dense(10, activation='tanh'), Dropout(0.2),
        Dense(1)
    ])
    optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
    model.compile(optimizer=optimizer, loss='mse')
    history = model.fit(x_data, y_data, epochs=50)
    histories.append(history.history['loss'])

fig, ax = plt.subplots()
ax.set(xlabel='Epoch', ylabel='Loss')
ax.plot(histories[0], label='Static learning rate')
ax.plot(histories[1], label='Learning rate schedule')
ax.legend()
plt.show()

I'm using Python 3.7.9 and the following install of Tensorflow:

$ conda list | grep tensorflow
tensorflow                2.1.0           mkl_py37h80a91df_0  
tensorflow-base           2.1.0           mkl_py37h6d63fb7_0  
tensorflow-estimator      2.1.0              pyhd54b08b_0
a_guest
  • 34,165
  • 12
  • 64
  • 118

2 Answers2

7

When using ExponentialDecay, what you're basically doing is to make a decayed learning rate like:

def decayed_learning_rate(step):
  return initial_learning_rate * decay_rate ^ (step / decay_steps)

When you set staircase=True, what happens is that step / decay_steps is an integer division and the rate follows a staircase function. Now, let's take a look at the source code:

# ...setup for step function...
global_step_recomp = math_ops.cast(step, dtype) # step is the current step count
p = global_step_recomp / decay_steps
if self.staircase:
  p = math_ops.floor(p)
return math_ops.multiply(initial_learning_rate, math_ops.pow(decay_rate, p), name=name)

And we can see that we have a variable p that updates at every multiple of decay_steps, so at step 25, 50, 75 and so on... Basically, the learning rate is constant for every 25 steps, not epochs - which is why it updates before the first 25 epochs. A good explanation on the difference can be read at What is the difference between steps and epochs in TensorFlow?

qedk
  • 468
  • 6
  • 18
2

The decay_steps paramater in ExponentialDecay does not mean number of epochs, but number of steps (training on a single batch). If you want the learning rate to start decaying at 25th epoch, this parameter should be 25 * (num_samples_of_whole_dataset / batch_size).

ouflak
  • 2,458
  • 10
  • 44
  • 49
JeremyQiu
  • 51
  • 1
  • 3