when I train a model multiple times, the training iterations slow down, even if all the relevant quantities are created inside a for
loop (and should therefore be overwritten each time, which should be enough to avoid creating growing computation graphs or filling up the memory or whatever).
This is the code I use:
import numpy as np
import tensorflow as tf
import time
n_samples = 300000
n_features = 100
n_targets = 5
batch_size = 100
x = np.array(range(n_samples * n_features), dtype=np.float64).reshape((n_samples, n_features))
y = np.array(range(n_samples * n_targets), dtype=np.float64).reshape((n_samples, n_targets))
for t_idx in range(10):
dataset = [x, y]
dataset = tf.data.Dataset.from_tensor_slices(tuple(dataset)).shuffle(n_samples).repeat().batch(batch_size=batch_size).prefetch(tf.data.experimental.AUTOTUNE)
data_iterator = iter(dataset)
inputs = tf.keras.Input(shape=(n_features,), name='input')
outputs = tf.keras.layers.Dense(n_features, name='dense_1', activation=tf.keras.activations.relu)(inputs)
outputs = tf.keras.layers.Dense(n_features, name='dense_2', activation=tf.keras.activations.relu)(outputs)
outputs = tf.keras.layers.Dense(n_features, name='dense_3', activation=tf.keras.activations.relu)(outputs)
outputs = tf.keras.layers.Dense(n_features, name='dense_4', activation=tf.keras.activations.relu)(outputs)
outputs = tf.keras.layers.Dense(n_targets, name='output', activation=tf.keras.activations.linear)(outputs)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
trainable_variables = list(model.trainable_variables)
adam_opt = tf.optimizers.Adam(learning_rate=0.001)
@tf.function
def loss(batch):
x_, y_ = batch
y_pred_ = model(x_)
return tf.keras.losses.MSE(y_pred_, y_)
@tf.function
def optimization_step():
batch = next(data_iterator)
def f(): return loss(batch)
adam_opt.minimize(f, var_list=trainable_variables)
iterations = 50000
loop_start = time.time()
optimization_times = []
for idx in range(iterations):
optimization_step()
loop_end = time.time()
print(f'Elapsed: {loop_end - loop_start}')
I know I could use model.fit()
to train the model, but this is not an option in my case (the code above was obtained by stripping down a much more complicated code, where the only option I have is to call the optimizer directly).
Output:
Elapsed: 49.798316955566406
Elapsed: 55.18571472167969
Elapsed: 58.57510209083557
Elapsed: 64.41855955123901
Elapsed: 66.76858448982239
Elapsed: 68.3305652141571
Elapsed: 67.73438382148743
Elapsed: 69.73751258850098
Elapsed: 73.59102845191956
Elapsed: 73.14124798774719
I expected the elapsed time for each training to be roughly the same, while instead there is a clear upward trend. What am I doing wrong?
System information
- OS: Windows 10
- python version: 3.7.9
- tensorflow version: 2.3.1
EDIT
Following the replies by @Nicolas Gervais and @M.Innat, I've been experimenting with the two versions of the code (mine and @M.Innat's) on different machines. This is what I discovered:
- There is quite a difference in the results on two separate Windows machines I tested on (same Windows version, same packages version, same python version, on a clean virtualenv); the best of the two still shows an upward trend in the training times, but smaller than the one I originally tested on; results on the best of the two are as follows:
WITH MY CODE:
Elapsed: 49.35429096221924
Elapsed: 52.551310777664185
Elapsed: 54.324320554733276
Elapsed: 56.53233051300049
Elapsed: 56.81632399559021
Elapsed: 58.70533752441406
Elapsed: 59.68834161758423
Elapsed: 61.419353008270264
Elapsed: 60.33834195137024
Elapsed: 62.536344051361084
WITH @M.Innat CODE:
Elapsed: 50.51127886772156
Elapsed: 53.94429612159729
Elapsed: 52.38828897476196
Elapsed: 54.5512957572937
Elapsed: 58.1543083190918
Elapsed: 61.21232509613037
Elapsed: 60.11531925201416
Elapsed: 59.95942974090576
Elapsed: 60.48531889915466
Elapsed: 59.37330341339111
- As you can see, times look slighly more stable than what I originally posted on this second Windows machine (but notice that they are basically the same, when I tested on the machine I used for my initial tests); nonetheless, the trend is still clearly visible; this is weird, since this did not happen in the results posted by @M.Innit;
- This issue does not come up either on Linux or on Mac OS.
I'm starting to feel like this is a bug in Tensorflow, somehow (even though it's weird that different machines with the same exact configuration yield such different results), but I wouldn't really know what to attribute this behaviour to