2

I'm trying to run several different ML architectures,
all vanilla, without any modification (git clone -> python train.py).
while the result is always the same- segmentation fault, or Resource exhausted: OOM when allocating tensor.
When running only on my CPU, the program finishes successfully
I'm running the session with

    config.gpu_options.per_process_gpu_memory_fraction=0.33
    config.gpu_options.allow_growth = True
    config.allow_soft_placement = True
    config.log_device_placement = True

And yet, the result is

2019-03-11 20:23:26.845851: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ***************************************************************x**********____**********____**_____*
2019-03-11 20:23:26.845885: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[32,128,1024,40] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):

2019-03-11 20:23:16.841149: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.59GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-11 20:23:16.841191: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.59GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-11 20:23:26.841486: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 640.00MiB.  Current allocation summary follows.
2019-03-11 20:23:26.841566: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256):   Total Chunks: 195, Chunks in use: 195. 48.8KiB allocated for chunks. 48.8KiB in use in bin. 23.3KiB client-requested in use in bin.

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[32,128,1024,40] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node transform_net1/tconv2/bn/moments/SquaredDifference (defined at /home/dvir/CLionProjects/gml/Dvir/FlexKernels/utils/tf_util.py:504)  = SquaredDifference[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transform_net1/tconv2/BiasAdd, transform_net1/tconv2/bn/moments/mean)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
     [[{{node div/_113}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1730_div", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

I'm running with

tensorflow-gpu 1.12
tensorflow 1.13

GPU is

GeForce RTX 2080TI

The model is Dynamic Graph CNN for Learning on Point Clouds, and was tested successfully on another machine with 1080 ti.

DsCpp
  • 2,259
  • 3
  • 18
  • 46
  • Looks like you are running out of GPU memory. What does your model look like? – FlyingTeller Mar 12 '19 at 11:22
  • Might be that your GPU resources have been reserved by another process (e.g. TensorFlow session that hasn't been closed). What do you get when you type `nvidia-smi` in the shell? – Karl Mar 12 '19 at 12:32
  • @DsCpp I'm experiencing something similar. I'm currently using Cuda 10.0 and CuDnn 7.5. You've mentioned that upgrading your drivers fixed the issue. What Cuda/Cudnn versions are you using? – Jed Aug 09 '19 at 07:00
  • @Jed I'm currently on 10 and cudnn 7.6, but I ended up understanding that my model was just too big for my hardware, after reducing the batch size, and implementing it as a multi-gpu model the OOM stopped. – DsCpp Aug 10 '19 at 08:29
  • @DsCpp thanks for following up. I'm currently on 10 and 7.5, recently upgraded from 7.2, and things are better with 7.5, but occasionally still getting OOM, regardless of batch size and model size. I'm running TF 2.0 beta1 though, so perhaps there is a memory issue that hasn't been resolved. – Jed Aug 10 '19 at 09:40
  • Are you using non trivial layers? I've encountered bad performance with the cholesky decomposition layer, but with the "ordinary" ones things should be stable. – DsCpp Aug 10 '19 at 15:12

2 Answers2

2

For TensorFlow 2.2.0 this script works -

if tf.config.list_physical_devices('GPU'):
    physical_devices = tf.config.list_physical_devices('GPU')
    tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)
    tf.config.experimental.set_virtual_device_configuration(physical_devices[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4000)])

https://stackoverflow.com/a/63123354/5884380

Sadidul Islam
  • 1,088
  • 12
  • 13
0

As explained here the line config.gpu_options.per_process_gpu_memory_fraction=0.33 determines the fraction of the overall amount of memory from visible GPU should be allocated (33% for you case). Increasing this value or removing this line(100%) will give more of needed memory.

D_negn
  • 378
  • 4
  • 13
  • Unfortunately the problem was probably in my cuda drivers. after installing ubuntu and the drivers again everything is working. (setting _fraction to 1 didn't help, as it allocated 100% of the available space no matter what) – DsCpp Mar 13 '19 at 07:45
  • Glad it is working, but I think setting_fraction should have an effect. Maybe you can experiment now setting it to different value, since the problem is solved and see the effect. – D_negn Mar 13 '19 at 09:14
  • As the model is relatively small, with 12 GB of memory (2080TI), even per_process_gpu_memory_fraction=0.1 was sufficient. I think that a nasty bug in the cuda drivers along with bad nvidia drivers caused it to allocate all the memory right away, no matter what job it tried to run – DsCpp Mar 13 '19 at 12:32