Update TensorFlow optimizer to allow retaking interrupted training

Question

My model is an implementation of TensorFlow, and since it might take a long time to train, I am trying to implement a way to retake it when training has been interrupted, but I haven't been able to make sure I get the same results when training was retaken or when it ran in a single run. Both seed and tf.config.experimental.enable_op_determinism are set and working as intented - I can get the same results with two complete runs. But the results diverge at the very fist epoch after retaking an interrupted run.

Saving/reloading checkpoint weights is working fine, that shouldn't be the issue. My data is enclosed in a tf.data.Dataset object with shuffling enabled, but I also made sure it produces epochs and batches that are discarded when training is being retaken, and I double (triple!) checked that the data produced is actually aligned with the desired behavior.

My remaining suspection pertains to the optimizer, since it uses a schedule on the learning rate. I set a custom scheduler:

class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model: int, warmup_steps: int):
        super(CustomSchedule, self).__init__()
        self.d_model = tf.cast(d_model, tf.float32)
        self.warmup_steps = warmup_steps

    def __call__(self, step): # <- depends on step, which needs aligning
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps ** -1.5)

        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

and then I pass it to an instance of Adam:

optimizer = tf.keras.optimizers.Adam(CustomSchedule(d_model, warmup_steps), beta_1=beta_1, beta_2=beta_2, epsilon=epsilon)

When trying to align the optimizer, I tried three different approaches:

#1: update iterations attribute in optimizer, which is taken to be the initial step for the scheduler (I checked in execution time and it is so):

initial_step = 0
for epoch in range(starting_epoch):
    for batch, data in enumerate(dataset):
        initial_step += 1
self.optimizer.iterations.assign_add(initial_step)

#2: include initial_step attribute in the scheduler, then update it (also works in execution time):

class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model: int, initial_step: int = 0, warmup_steps: int):
        super(CustomSchedule, self).__init__()
        self.d_model = tf.cast(d_model, tf.float32)
        self.initial_step = initial_step
        self.warmup_steps = warmup_steps

    def __call__(self, step):
        arg1 = tf.math.rsqrt(step + self.initial_step)
        arg2 = (step + self.initial_step) * (self.warmup_steps ** -1.5)

        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

for epoch in range(starting_epoch):
    for batch, data in enumerate(dataset):
        optimizer.learning_rate.initial_step += 1 # <- optimizer.lr is also updated, guess they might be aliases

#3 reinstanciate the optimizer altogether:

initial_step = 0
for epoch in range(starting_epoch):
    for batch, data in enumerate(dataset):
        initial_step += 1
optimizer = tf.keras.optimizers.Adam(CustomSchedule(d_model, initial_step, warmup_steps), beta_1=beta_1, beta_2=beta_2, epsilon=epsilon)

All three approaches result in the desired behavior when the execution enters the scheduler __call__ method, meaning step is updated to reflect that it is not the very first one. Nonetheless, the results are different.

Might I be missing something else?

=====

UPDATE:

After some excruciating debuging, it turns out any of the approaches work, and they are indeed necessary. Nonetheless, my main issue was with the Dropout layers; when loading a checkpoint, they are not called then their internal workings are disaligned with the seed. I could force an alignemnt passing data thorugh them without updating weights, which is not ideal (the calculations for the epochs that should be ignored are done, but discarded, even though I save time skipping all backpropagation-related steps, which do not depend on the seed) but I don't think there is an alternative.

Only... reproducibility is still compromised. I carefully followed the steps and noticed that all data is identical when training from the beginning or retaking a checkpoint up to apply_gradients. All weights are the same and even the gradients themselves, but for some reason that I could not identify, after applying the gradients, the weights diverge. The inner parameters of the optimizer (betas, decays, learning rates...) are also identical. For the time being, I have no idea why this happens.

Update TensorFlow optimizer to allow retaking interrupted training

0 Answers0