Use of tf.GradientTape() exhausts all the gpu memory, without it it doesn't matter

Question

I'm working on Convolution Tasnet, model size I made is about 5.05 million variables.

I want to train this using custom training loops, and the problem is,

for i, (input_batch, target_batch) in enumerate(train_ds): # each shape is (64, 32000, 1)
    with tf.GradientTape() as tape:
        predicted_batch = cv_tasnet(input_batch, training=True) # model name
        loss = calculate_sisnr(predicted_batch, target_batch) # some custom loss
    trainable_vars = cv_tasnet.trainable_variables
    gradients = tape.gradient(loss, trainable_vars)
    cv_tasnet.optimizer.apply_gradients(zip(gradients, trainable_vars))

This part exhausts all the gpu memory (24GB available)..
When I tried without tf.GradientTape() as tape,

for i, (input_batch, target_batch) in enumerate(train_ds):
        predicted_batch = cv_tasnet(input_batch, training=True)
        loss = calculate_sisnr(predicted_batch, target_batch)

This uses a reasonable amount of gpu memory(about 5~6GB).

I tried the same format of tf.GradientTape() as tape for the basic mnist data, then it works without problem.
So would the size matter? but the same error arises when I lowered BATCH_SIZE to 32 or smaller.

Why the 1st code block exhausts all the gpu memory?

Of course, I put

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

this code at the very first cell.

score 1 · Accepted Answer · edited Jan 07 '22 at 11:08

1

Gradient tape triggers automatic differentiation which requires tracking gradients on all your weights and activations. Autodiff requires multiple more memory. This is normal. You'll have to manually tune your batch size until you find one that works, then tune your LR. Usually, the tune just means guess & check or grid search. (I am working on a product to do all of that for you but I'm not here to plug it).

edited Jan 07 '22 at 11:08

Innat

16,113
6
53
101

answered Jan 07 '22 at 02:11

Yaoshiang

1,713
5
15

This is both encouraging and discouraging news...encouraging in the sense that I didn't do anything wrong, but discouraging in the sense that I've got to smaller things than I wanted. – HyeonPhil Youn Jan 07 '22 at 06:36
1

@HyeonPhilYoun for tuning batch size based on model size for GPU v-ram, you may find the following link helpful. [gist](https://colab.research.google.com/drive/1FePCsRdutXNyTCiMGuXdXaB8QDj453R0?usp=sharing). – Innat Jan 07 '22 at 11:44
And for tuning lr based on batch size, you can check the following as options - (1). [version-13-cell-14](https://www.kaggle.com/ipythonx/tf-keras-ranzcr-multi-attention-efficientnet), (2). [version-17-cell-13](https://www.kaggle.com/ipythonx/tf-keras-learning-to-resize-image-for-vit-model). – Innat Jan 07 '22 at 11:49
I made `BATCH_SIZE` to 8 then it worked. Even this `BATCH_SIZE` occupies 70% of 24GB V-ram.. and when I decorated the `train_step` with `@tf.function`, training loop doesn't work, I mean it seems idle forever(doesn't use any of CPU, GPU at all). Do you have any idea with it? Without it training loop works. – HyeonPhil Youn Jan 07 '22 at 11:52
@M.Innat That colab looks really helpful, I will check it. Thank you. – HyeonPhil Youn Jan 07 '22 at 11:54

Use of tf.GradientTape() exhausts all the gpu memory, without it it doesn't matter

1 Answers1