Cleaning Google TPU memory (python)

Question

My python code has two steps. In each step, I train a neural network (primarily using from mesh_transformer.transformer_shard import CausalTransformer and delete the network before the next step that I train another network with the same function. The problem is that in some cases, I receive this error:

Resource exhausted: Failed to allocate request for 32.00MiB (33554432B) on device ordinal 0: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

I think there is still some remaining stuff in the TPU memory I need to remove except that network. The point here is that both steps are independent, and they don't share any information or variable. But I have to do this sequentially to manage my storage on Google cloud. Also, when I run these two steps separately, it works fine. Is there any way to clean TPU memory thoroughly before going to the next step of my code? I think just removing the network is not enough.

@EduardoOrtiz I'm using 3.8 (`jax==0.2.12` and `jaxlib==0.1.68`). — Eghbal, Feb 18 '22 at 20:19
@Eghbal I am also trying to train GPT2 with JAX however I get the same error. Could you resolve this issue? — A.Najafi, Nov 14 '22 at 18:52

score 1 · Answer 1 · answered Feb 21 '22 at 22:54

Unfortunately, you can’t clean the TPU memory, but you can reduce memory usage by these options;

The most effective ways to reduce memory usage are to:

Reduce excessive tensor padding

Tensors in TPU memory are padded, that is, the TPU rounds up the sizes of tensors stored in memory to perform computations more efficiently. This padding happens transparently at the hardware level and does not affect results. However, in certain cases the padding can result in significantly increased memory use and execution time.

Reduce the batch size

Slowly reduce the batch size until it fits in memory, making sure that the total batch size is a multiple of 64 (the per-core batch size has to be a multiple of 8). Keep in mind that larger batch sizes are more efficient on the TPU. A total batch size of 1024 (128 per core) is generally a good starting point.

If the model cannot be run on the TPU even with a small batch size (for example, 64), try reducing the number of layers or the layer sizes.

You could read more about troubleshooting in this documentation

score 0 · Answer 2 · answered Feb 23 '22 at 02:02

You can try to clean TPU state after each training and see if that helps with

tf.tpu.experimental.shutdown_tpu_system() call.

Another option is to restart the TPU to clean the memory using:

pip3 install cloud-tpu-client

import tensorflow as tf
from cloud_tpu_client import Client
print(tf.__version__)

Client().configure_tpu_version(tf.__version__, restart_type='always')

Cleaning Google TPU memory (python)

2 Answers2