4

I have recently switched to Tensorflow Eager (currently working with TF 1.8.0) and like it a lot. However, I now have quite a large model which does not fit into my GPU Memory (GTX 1080Ti, 12GB VRAM) when run with the Gradient Tape which is needed to calculate the gradients in TF. The forward pass (i.e. without using Gradient Tape) works fine.

I thought about using the Gradient Checkpointing from OpenAI with the hope that this would help. However, simply using it as described in their Git does not seem to help within Eager Execution, i.e.

import tensorflow as tf
import tensorflow.contrib.eager as tfe
import memory_saving_gradients
tf.__dict__["gradients"] = memory_saving_gradients.gradients_memory
# using gradients_memory or gradients_speed does not change anything
# tf.__dict__["gradients"] = memory_saving_gradients.gradients_speed

[...]
with tfe.GradientTape() as g:
    output = run_large_model()
    loss = calculate_loss_on_output(output)
grads = g.gradient(full, model.variables)
optimizer.apply_gradients(zip(grads, model.variables))

runs out of memory, independent of using gradient checkpointing or not.

My guess is that the gradient tape still stores all variables and the required information for the backward pass and the gradient checkpointing has no effect because TF in Eager mode doesn't actually construct a graph (from what I understand - or at least it's a different graph).

Do you have any experience or any idea how this could be solved or what I need to do to use the gradient checkpointing also in TF Eager mode?

Nemorior
  • 111
  • 6

1 Answers1

5

The gradient checkpointing code from openai is based on graph rewriting, so it does not support eager execution.

The tensorflow.contrib.layers library has a recompute_grad decorator which is equivalent but is supported in both graph and eager execution.

Alexandre Passos
  • 5,186
  • 1
  • 14
  • 19
  • Thanks for your reply. Is there a way to easily combine this recompute_grad decorator with keras models? Suppose I have a Keras model `class BigNeuralNet(tf.keras.Model)` which initializes a number of layers in its `__init__()` method (e.g. `self.layer1 = tf.layers.Conv2D(...)` and then uses these layers in the `call()` method. The call method of the model takes as input a tensor, but also some other values (e.g. training or inference to control Batch Normalization) which are used during the forward pass. Is it enough to just use the `recompute_grad`decorator on the model's call method? – Nemorior Jun 11 '18 at 07:46
  • 1
    I think it should be. If it isn't, open a github issue and CC me and we'll discuss there. – Alexandre Passos Jun 11 '18 at 16:45
  • @AlexandrePassos How did you get `recompute_grad` to work? I am using Keras and it doesn't work for me. I've contacted Joey Yearsley since he used it for DenseNet, but he hasn't gotten it to work with keras – adam.hendry Jul 13 '19 at 19:24
  • Can you try tf.recompute_grad from the 2.0 nightly build? We have tests that it works with keras. If it doesn't for you can you file a github issue with a short reproducing example? – Alexandre Passos Jul 15 '19 at 19:31