The solution to my problem
Unfortunately, my question is not answered in the question this is supposed to be a duplicate of. While it is true that the graph is changed during training, calling finalize
does not fix it, as Keras is the underlying problem. The correct answer is found here. For each model, I need to call _make_predict_function()
after compiling. The model I call model.fit()
on needs to be "warmed up" as described in the answer by calling predict()
before finalizing and fitting.
The answer also explains why this happens; Keras tries to save memory by building the graph as late as possible.
The original question
I am training a LSTM encoder-decoder model using Keras and Tensorflow-GPU running on a Nvidia Tesla K80 with two cores on a Gentoo machine. No other processes are using the GPU according to nvidia-smi
, TensorFlow can access the GPUs just fine. My batch size is 4.
The training runs smoothly (without any warnings whatsoever) for a while. However, after some time, an out-of-memory exception is raised and training is stopped mid-epoch. I don't understand how a memory error can occur during training within an epoch and without any warning before, as I thought that once the tensors are allocated, no additional memory would be necessary.
In the past, Tensorflow notified me when it allocated more than 10 % of my GPU's memory, but this did not happen this time.
Here's some information from when the OOM occurred (right after tensorflow wrote about every chunk in use), it happened during epoch 34 of 50 scheduled.
2018-08-11 06:15:07.676836: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 8.43GiB
2018-08-11 06:15:07.676848: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats:
Limit: 11286285517
InUse: 9053307136
MaxInUse: 10209047296
NumAllocs: 975991019
MaxAllocSize: 1223803648
2018-08-11 06:15:07.698578: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ******************************************_________*********************_________*******************
2018-08-11 06:15:07.815583: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[4,7328,9420] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "GridSearch.py", line 139, in <module>
for (models, vals) in gs:
File "../lstms/GridSearch.py", line 25, in next_function_call
yield self.function_call(**val), val
File "../lstms/modeling.py", line 168, in define_and_train
train.fit([x1, x2], y, epochs=n_epoch, callbacks=[checkpointer])
File "/usr/lib64/python3.6/site-packages/keras/engine/training.py", line 1042, in fit
validation_steps=validation_steps)
File "/usr/lib64/python3.6/site-packages/keras/engine/training_arrays.py", line 199, in fit_loop
outs = f(ins_batch)
File "/usr/lib64/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2661, in __call__
return self._call(inputs)
File "/usr/lib64/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2631, in _call
fetched = self._callable_fn(*array_vals)
File "/usr/lib64/python3.6/site-packages/tensorflow/python/client/session.py", line 1454, in __call__
self._session._session, self._handle, args, status, None)
File "/usr/lib64/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,7328,9420] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: dec_dense_48/add = Add[T=DT_FLOAT, _class=["loc:@training_24/Adam/gradients/dec_dense_48/add_grad/Sum"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](dec_dense_48/Reshape_2, dec_dense_48/Reshape_3)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: loss_24/mul/_2957 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3053_loss_24/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
I am aware that I'm handling big tensors, but I don't understand why they seem to grow bigger over time, let alone during training.
Update: The comments linked to another question, whose answer suggested to use tf.get_default_graph().finalize
. After defining my model using the Keras functional API and before starting training, I thus finalized the training model. But calling model.fit()
then raised RuntimeError: Graph is finalized and cannot be modified.
So is this a bug in Keras then that I cannot train the model without modifying the graph or is it another issue altogether?