My python code has two steps. In each step, I train a neural network (primarily using from mesh_transformer.transformer_shard import CausalTransformer
and delete the network before the next step that I train another network with the same function. The problem is that in some cases, I receive this error:
Resource exhausted: Failed to allocate request for 32.00MiB (33554432B) on device ordinal 0: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
I think there is still some remaining stuff in the TPU memory I need to remove except that network. The point here is that both steps are independent, and they don't share any information or variable. But I have to do this sequentially to manage my storage on Google cloud. Also, when I run these two steps separately, it works fine. Is there any way to clean TPU memory thoroughly before going to the next step of my code? I think just removing the network is not enough.