1

when I train a model multiple times, the training iterations slow down, even if all the relevant quantities are created inside a for loop (and should therefore be overwritten each time, which should be enough to avoid creating growing computation graphs or filling up the memory or whatever).

This is the code I use:

import numpy as np
import tensorflow as tf
import time

n_samples = 300000
n_features = 100
n_targets = 5
batch_size = 100
x = np.array(range(n_samples * n_features), dtype=np.float64).reshape((n_samples, n_features))
y = np.array(range(n_samples * n_targets), dtype=np.float64).reshape((n_samples, n_targets))
for t_idx in range(10):
    dataset = [x, y]
    dataset = tf.data.Dataset.from_tensor_slices(tuple(dataset)).shuffle(n_samples).repeat().batch(batch_size=batch_size).prefetch(tf.data.experimental.AUTOTUNE)
    data_iterator = iter(dataset)

    inputs = tf.keras.Input(shape=(n_features,), name='input')
    outputs = tf.keras.layers.Dense(n_features, name='dense_1', activation=tf.keras.activations.relu)(inputs)
    outputs = tf.keras.layers.Dense(n_features, name='dense_2', activation=tf.keras.activations.relu)(outputs)
    outputs = tf.keras.layers.Dense(n_features, name='dense_3', activation=tf.keras.activations.relu)(outputs)
    outputs = tf.keras.layers.Dense(n_features, name='dense_4', activation=tf.keras.activations.relu)(outputs)
    outputs = tf.keras.layers.Dense(n_targets, name='output', activation=tf.keras.activations.linear)(outputs)
    model = tf.keras.Model(inputs=inputs, outputs=outputs)

    trainable_variables = list(model.trainable_variables)

    adam_opt = tf.optimizers.Adam(learning_rate=0.001)


    @tf.function
    def loss(batch):
        x_, y_ = batch
        y_pred_ = model(x_)
        return tf.keras.losses.MSE(y_pred_, y_)


    @tf.function
    def optimization_step():
        batch = next(data_iterator)
        def f(): return loss(batch)
        adam_opt.minimize(f, var_list=trainable_variables)

    iterations = 50000
    loop_start = time.time()
    optimization_times = []
    for idx in range(iterations):
        optimization_step()

    loop_end = time.time()
    print(f'Elapsed: {loop_end - loop_start}')

I know I could use model.fit() to train the model, but this is not an option in my case (the code above was obtained by stripping down a much more complicated code, where the only option I have is to call the optimizer directly).

Output:

Elapsed: 49.798316955566406
Elapsed: 55.18571472167969
Elapsed: 58.57510209083557
Elapsed: 64.41855955123901
Elapsed: 66.76858448982239
Elapsed: 68.3305652141571
Elapsed: 67.73438382148743
Elapsed: 69.73751258850098
Elapsed: 73.59102845191956
Elapsed: 73.14124798774719

I expected the elapsed time for each training to be roughly the same, while instead there is a clear upward trend. What am I doing wrong?

System information

  • OS: Windows 10
  • python version: 3.7.9
  • tensorflow version: 2.3.1

EDIT

Following the replies by @Nicolas Gervais and @M.Innat, I've been experimenting with the two versions of the code (mine and @M.Innat's) on different machines. This is what I discovered:

  • There is quite a difference in the results on two separate Windows machines I tested on (same Windows version, same packages version, same python version, on a clean virtualenv); the best of the two still shows an upward trend in the training times, but smaller than the one I originally tested on; results on the best of the two are as follows:
    WITH MY CODE:
    Elapsed: 49.35429096221924                                                                                                                                                                                                                   
    Elapsed: 52.551310777664185                                                                                                                                                                                                                  
    Elapsed: 54.324320554733276                                                                                                                                                                                                                  
    Elapsed: 56.53233051300049                                                                                                                                                                                                                   
    Elapsed: 56.81632399559021                                                                                                                                                                                                                   
    Elapsed: 58.70533752441406                                                                                                                                                                                                                   
    Elapsed: 59.68834161758423                                                                                                                                                                                                                   
    Elapsed: 61.419353008270264                                                                                                                                                                                                                  
    Elapsed: 60.33834195137024                                                                                                                                                                                                                   
    Elapsed: 62.536344051361084

    WITH @M.Innat CODE:
    Elapsed: 50.51127886772156                                                                                                                                                                                                                   
    Elapsed: 53.94429612159729                                                                                                                                                                                                                   
    Elapsed: 52.38828897476196                                                                                                                                                                                                                   
    Elapsed: 54.5512957572937                                                                                                                                                                                                                    
    Elapsed: 58.1543083190918                                                                                                                                                                                                                    
    Elapsed: 61.21232509613037                                                                                                                                                                                                                   
    Elapsed: 60.11531925201416                                                                                                                                                                                                                   
    Elapsed: 59.95942974090576                                                                                                                                                                                                                   
    Elapsed: 60.48531889915466                                                                                                                                                                                                                   
    Elapsed: 59.37330341339111
  • As you can see, times look slighly more stable than what I originally posted on this second Windows machine (but notice that they are basically the same, when I tested on the machine I used for my initial tests); nonetheless, the trend is still clearly visible; this is weird, since this did not happen in the results posted by @M.Innit;
  • This issue does not come up either on Linux or on Mac OS.

I'm starting to feel like this is a bug in Tensorflow, somehow (even though it's weird that different machines with the same exact configuration yield such different results), but I wouldn't really know what to attribute this behaviour to

2 Answers2

0

You might want to try tf.keras.backend.clear_session(), as covered in the documentation, where it explains what happens to memory consumption when models are created in a loop:

If you are creating many models in a loop, this global state will consume an increasing amount of memory over time, and you may want to clear it. Calling clear_session() releases the global state: this helps avoid clutter from old models and layers, especially when memory is limited.

Without clear_session(), each iteration of this loop will slightly increase the size of the global state managed by Keras

for _ in range(100):
  model = tf.keras.Sequential([tf.keras.layers.Dense(10) for _ in range(10)])

With clear_session() called at the beginning, Keras starts with a blank state at each iteration and memory consumption is constant over time.

for _ in range(100):
  tf.keras.backend.clear_session()
  model = tf.keras.Sequential([tf.keras.layers.Dense(10) for _ in range(10)])
Nicolas Gervais
  • 33,817
  • 13
  • 115
  • 143
  • 1
    Thanks for the quick reply! Indeed, I missed that in the documentation and it did seem promising, but unfortunately results are still the same even after I add a call to `clear_session()` at the beginning of each iteration (training times get ever so slightly lower, but the upward trend remains) – Mattia Tamellini Mar 19 '21 at 17:24
0

I don't have an exact reason what's the main cause of it. It looks like they're some kind of memory leak probably. In your code, I ran it with a bit of modification. Try this:

import numpy as np
import tensorflow as tf
import time, gc

tf.config.run_functions_eagerly(False)

n_samples = 300000
n_features = 100
n_targets = 5
batch_size = 100
x = np.array(range(n_samples * n_features), dtype=np.float64).reshape((n_samples, n_features))
y = np.array(range(n_samples * n_targets), dtype=np.float64).reshape((n_samples, n_targets))

def get_model():
    inputs = tf.keras.Input(shape=(n_features,), name='input')
    outputs = tf.keras.layers.Dense(n_features, name='dense_1', activation=tf.keras.activations.relu)(inputs)
    outputs = tf.keras.layers.Dense(n_features, name='dense_2', activation=tf.keras.activations.relu)(outputs)
    outputs = tf.keras.layers.Dense(n_features, name='dense_3', activation=tf.keras.activations.relu)(outputs)
    outputs = tf.keras.layers.Dense(n_features, name='dense_4', activation=tf.keras.activations.relu)(outputs)
    outputs = tf.keras.layers.Dense(n_targets, name='output', activation=tf.keras.activations.linear)(outputs)
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    return model

    
for t_idx in range(7):
    tf.keras.backend.clear_session()
    gc.collect()
    
    dataset = [x, y]
    dataset = tf.data.Dataset.from_tensor_slices(tuple(dataset)).shuffle(n_samples).repeat().batch(batch_size=batch_size).prefetch(tf.data.experimental.AUTOTUNE)
    data_iterator = iter(dataset)
    
    model = get_model()
    trainable_variables = list(model.trainable_variables)
    adam_opt = tf.optimizers.Adam(learning_rate=0.001)
    
    @tf.function
    def loss(batch):
        x_, y_ = batch
        y_pred_ = model(x_)
        return tf.keras.losses.MSE(y_pred_, y_)

    @tf.function
    def optimization_step():
        batch = next(data_iterator)
        def f(): return loss(batch)
        adam_opt.minimize(f, var_list=trainable_variables)

    iterations = 50000
    loop_start = time.time()
    
    for idx in range(iterations):
        optimization_step()

    print(f'Elapsed: {time.time() - loop_start}')
    
    del model, dataset

When using tf.config.run_functions_eagerly(False), it gave as follows

Elapsed: 123.39986753463745
Elapsed: 121.13122296333313
Elapsed: 118.44610977172852
Elapsed: 116.83040761947632
Elapsed: 118.46350479125977
Elapsed: 118.59502696990967
Elapsed: 120.34098505973816

And without it gave as follows

Elapsed: 114.46430349349976
Elapsed: 124.23725700378418
Elapsed: 126.14825916290283
Elapsed: 127.5985496044159
Elapsed: 126.79593586921692
Elapsed: 124.1206603050232
Elapsed: 125.85739850997925

Though I don't think that is the main issue. There can be something more that needs a deep look.

[My OS: Windows 10, TF: 2.4.1, RTX:2070.]

Innat
  • 16,113
  • 6
  • 53
  • 101