1

I need to train a neural network model (4 GRU layers) implemented in TensorFlow. The code I got from another developer was originally a Jupyter notebook.

If I run the notebook the model trains fine, the RAM usage is around 20% and the GPU usage is about 11GB.

If I copy the code into a python file and run it, it keeps crashing. This happens also if I reduce the batch size. In particular, the RAM usage is much higher and the GPU memory used much less (about 2.5 GB).

The error messages I get are like these:

2401/2402 [============================>.] - ETA: 0s - loss: 0.0866 - accuracy: 0.9831 - precision: 0.0000e+00 - recall: 0.0000e+002021-08-12 08:29:31.894194: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 2053413600 exceeds 10% of free system memory.
[...]
Filling up shuffle buffer (this may take a while): 15082 of 19299
Killed

Do you have any suggestions on how to fix this? I would prefer to train the model using the python file (the time per epoch is much smaller).

albus_c
  • 6,292
  • 14
  • 36
  • 77
  • 1
    It's hard to know where to start without a minimal working example. Any chance you can provide the `.py` file? Or a minimal working example that crashes – TC Arlen Aug 13 '21 at 18:04
  • Unfortunately the memory usage has to do with data loading, and the data is proprietary :( – albus_c Aug 14 '21 at 08:28
  • Hi! can you please checkout the answers in following two issues . https://github.com/tensorflow/tensorflow/issues/32376 https://stackoverflow.com/questions/50929266/tensorflow-object-detection-api-killed-oom-how-to-reduce-shuffle-buffer-size –  Aug 18 '21 at 07:15

0 Answers0