Why does Tensorflow-GPU run out of memory mid-epoch?

Question

The solution to my problem

Unfortunately, my question is not answered in the question this is supposed to be a duplicate of. While it is true that the graph is changed during training, calling finalize does not fix it, as Keras is the underlying problem. The correct answer is found here. For each model, I need to call _make_predict_function() after compiling. The model I call model.fit() on needs to be "warmed up" as described in the answer by calling predict() before finalizing and fitting.

The answer also explains why this happens; Keras tries to save memory by building the graph as late as possible.

The original question

I am training a LSTM encoder-decoder model using Keras and Tensorflow-GPU running on a Nvidia Tesla K80 with two cores on a Gentoo machine. No other processes are using the GPU according to nvidia-smi, TensorFlow can access the GPUs just fine. My batch size is 4.

The training runs smoothly (without any warnings whatsoever) for a while. However, after some time, an out-of-memory exception is raised and training is stopped mid-epoch. I don't understand how a memory error can occur during training within an epoch and without any warning before, as I thought that once the tensors are allocated, no additional memory would be necessary.

In the past, Tensorflow notified me when it allocated more than 10 % of my GPU's memory, but this did not happen this time.

Here's some information from when the OOM occurred (right after tensorflow wrote about every chunk in use), it happened during epoch 34 of 50 scheduled.

2018-08-11 06:15:07.676836: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 8.43GiB
2018-08-11 06:15:07.676848: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats:
Limit:                 11286285517
InUse:                  9053307136
MaxInUse:              10209047296
NumAllocs:               975991019
MaxAllocSize:           1223803648

2018-08-11 06:15:07.698578: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ******************************************_________*********************_________*******************
2018-08-11 06:15:07.815583: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[4,7328,9420] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "GridSearch.py", line 139, in <module>
    for (models, vals) in gs:
  File "../lstms/GridSearch.py", line 25, in next_function_call
    yield self.function_call(**val), val
  File "../lstms/modeling.py", line 168, in define_and_train
    train.fit([x1, x2], y, epochs=n_epoch, callbacks=[checkpointer])
  File "/usr/lib64/python3.6/site-packages/keras/engine/training.py", line 1042, in fit
    validation_steps=validation_steps)
  File "/usr/lib64/python3.6/site-packages/keras/engine/training_arrays.py", line 199, in fit_loop
    outs = f(ins_batch)
  File "/usr/lib64/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2661, in __call__
    return self._call(inputs)
  File "/usr/lib64/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2631, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/lib64/python3.6/site-packages/tensorflow/python/client/session.py", line 1454, in __call__
    self._session._session, self._handle, args, status, None)
  File "/usr/lib64/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,7328,9420] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[Node: dec_dense_48/add = Add[T=DT_FLOAT, _class=["loc:@training_24/Adam/gradients/dec_dense_48/add_grad/Sum"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](dec_dense_48/Reshape_2, dec_dense_48/Reshape_3)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[Node: loss_24/mul/_2957 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3053_loss_24/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

I am aware that I'm handling big tensors, but I don't understand why they seem to grow bigger over time, let alone during training.

Update: The comments linked to another question, whose answer suggested to use tf.get_default_graph().finalize. After defining my model using the Keras functional API and before starting training, I thus finalized the training model. But calling model.fit() then raised RuntimeError: Graph is finalized and cannot be modified.

So is this a bug in Keras then that I cannot train the model without modifying the graph or is it another issue altogether?

@P-Gn Please see my update: When I finalize the model before training, an exception is raised and it is during training that I run out of memory, so I don't think this helps. — Zollern, Aug 13 '18 at 11:48
@user2906838 My batch size is 4. When reducing it, the OOM only occurs later and it still doesn't explain why there is additional memory needed. — Zollern, Aug 13 '18 at 11:49
It is not a bug in Keras — you should *not* add nodes at each iteration. Again this is the same issue as the other question. If my answer there is not clear then please rather comment there than here. I would rather maintain and improve a single answer than posting several related but incomplete answers around. — P-Gn, Aug 13 '18 at 14:20
@P-Gn Could it be a possible duplicate of [Keras Tensorflow - Exception while predicting from multiple threads](https://stackoverflow.com/a/46757715/5535245)? [*](https://stackoverflow.com/review/reopen/20586801) — georgeawg, Aug 14 '18 at 23:10
@georgeawg could be indeed, although nothing in the original question suggested multithreading is involved. — P-Gn, Aug 15 '18 at 09:46
@P-Gn I don't use multithreading. However, the answer provided there is the answer to my question, as calling the `make_X_function` methods allows me to use a finalized graph before calling Keras' `fit()` function. — Zollern, Aug 16 '18 at 07:30

Why does Tensorflow-GPU run out of memory mid-epoch?

0 Answers0