My model is an implementation of TensorFlow, and since it might take a long time to train, I am trying to implement a way to retake it when training has been interrupted, but I haven't been able to make sure I get the same results when training was retaken or when it ran in a single run. Both seed
and tf.config.experimental.enable_op_determinism
are set and working as intented - I can get the same results with two complete runs. But the results diverge at the very fist epoch after retaking an interrupted run.
Saving/reloading checkpoint weights is working fine, that shouldn't be the issue. My data is enclosed in a tf.data.Dataset
object with shuffling enabled, but I also made sure it produces epochs and batches that are discarded when training is being retaken, and I double (triple!) checked that the data produced is actually aligned with the desired behavior.
My remaining suspection pertains to the optimizer, since it uses a schedule on the learning rate. I set a custom scheduler:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, d_model: int, warmup_steps: int):
super(CustomSchedule, self).__init__()
self.d_model = tf.cast(d_model, tf.float32)
self.warmup_steps = warmup_steps
def __call__(self, step): # <- depends on step, which needs aligning
arg1 = tf.math.rsqrt(step)
arg2 = step * (self.warmup_steps ** -1.5)
return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)
and then I pass it to an instance of Adam:
optimizer = tf.keras.optimizers.Adam(CustomSchedule(d_model, warmup_steps), beta_1=beta_1, beta_2=beta_2, epsilon=epsilon)
When trying to align the optimizer, I tried three different approaches:
#1: update iterations
attribute in optimizer
, which is taken to be the initial step for the scheduler (I checked in execution time and it is so):
initial_step = 0
for epoch in range(starting_epoch):
for batch, data in enumerate(dataset):
initial_step += 1
self.optimizer.iterations.assign_add(initial_step)
#2: include initial_step
attribute in the scheduler, then update it (also works in execution time):
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, d_model: int, initial_step: int = 0, warmup_steps: int):
super(CustomSchedule, self).__init__()
self.d_model = tf.cast(d_model, tf.float32)
self.initial_step = initial_step
self.warmup_steps = warmup_steps
def __call__(self, step):
arg1 = tf.math.rsqrt(step + self.initial_step)
arg2 = (step + self.initial_step) * (self.warmup_steps ** -1.5)
return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)
for epoch in range(starting_epoch):
for batch, data in enumerate(dataset):
optimizer.learning_rate.initial_step += 1 # <- optimizer.lr is also updated, guess they might be aliases
#3 reinstanciate the optimizer altogether:
initial_step = 0
for epoch in range(starting_epoch):
for batch, data in enumerate(dataset):
initial_step += 1
optimizer = tf.keras.optimizers.Adam(CustomSchedule(d_model, initial_step, warmup_steps), beta_1=beta_1, beta_2=beta_2, epsilon=epsilon)
All three approaches result in the desired behavior when the execution enters the scheduler __call__
method, meaning step
is updated to reflect that it is not the very first one. Nonetheless, the results are different.
Might I be missing something else?
=====
UPDATE:
After some excruciating debuging, it turns out any of the approaches work, and they are indeed necessary. Nonetheless, my main issue was with the Dropout layers; when loading a checkpoint, they are not called then their internal workings are disaligned with the seed. I could force an alignemnt passing data thorugh them without updating weights, which is not ideal (the calculations for the epochs that should be ignored are done, but discarded, even though I save time skipping all backpropagation-related steps, which do not depend on the seed) but I don't think there is an alternative.
Only... reproducibility is still compromised. I carefully followed the steps and noticed that all data is identical when training from the beginning or retaking a checkpoint up to apply_gradients
. All weights are the same and even the gradients themselves, but for some reason that I could not identify, after applying the gradients, the weights diverge. The inner parameters of the optimizer (betas, decays, learning rates...) are also identical. For the time being, I have no idea why this happens.