Tensorflow object detection API killed - OOM. How to reduce shuffle buffer size?

Question

System information

OS Platform and Distribution: CentOS 7.5.1804
TensorFlow installed from: pip install tensorflow-gpu
TensorFlow version: tensorflow-gpu 1.8.0
CUDA/cuDNN version: 9.0/7.1.2
GPU model and memory: GeForce GTX 1080 Ti, 11264MB
Exact command to reproduce:

python train.py --logtostderr --train_dir=./models/train --pipeline_config_path=mask_rcnn_inception_v2_coco.config

Describe the problem

I am attempting to train a Mask-RCNN model on my own dataset (fine tuning from a model trained on COCO), but the process is killed as soon as the shuffle buffer is filled.

Before this happens, nvidia-smi shows memory usage of around 10669MB/11175MB but only 1% GPU utilisation.

I have tried adjusting the following train_config settings:

batch_size: 1    
batch_queue_capacity: 10    
num_batch_queue_threads: 4    
prefetch_queue_capacity: 5

And for train_input_reader:

num_readers: 1
queue_capacity: 10
min_after_dequeue: 5

I believe my problem is similar to TensorFlow Object Detection API - Out of Memory but I am using a GPU rather than CPU-only.

The images I am training on are comparatively large (2048*2048), however I would like to avoid downsizing as the objects to be detected are quite small. My training set consists of 400 images (in a .tfrecord file).

Is there a way to reduce the size of the shuffle buffer to see if this reduces the memory requirement?

Traceback

INFO:tensorflow:Restoring parameters from ./models/train/model.ckpt-0
INFO:tensorflow:Restoring parameters from ./models/train/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path ./models/train/model.ckpt
INFO:tensorflow:Saving checkpoint to path ./models/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global_step/sec: 0
2018-06-19 12:21:33.487840: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 97 of 2048
2018-06-19 12:21:43.547326: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 231 of 2048
2018-06-19 12:21:53.470634: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 381 of 2048
2018-06-19 12:21:57.030494: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:129] Shuffle buffer filled.
Killed

The shuffle.buffer is on the CPU bounded by the RAM (but there is swapping). Your inputs are simply too big for training. The maximum size a decent GPU can handle during training is something along 1300 x 1300. GPU utilization is not your issue. — Patwie, Jun 19 '18 at 16:20
Thanks. Reducing the max_dimension parameter in the config from 1365 to 900 solved the OOM issue. However, GPU utilisation is still showing as 0% (or single-digit). Surely this isn't the expected behaviour? — dpaddon, Jun 20 '18 at 12:56
Do you really expect an answer for that without showing any relevant code? — Patwie, Jun 20 '18 at 12:58
I am using the same command as above to run the train.py file to train my model, which is no longer killed as soon as the buffer is filled. However, during training, nvidia-smi shows GPU-Util of 0% while memory usage is 11680MiB / 12212MiB. — dpaddon, Jun 20 '18 at 13:50

score 5 · Answer 1 · edited Jun 13 '19 at 05:37

5

You can try steps as followings:

1.Set batch_size=1 (or try your own)

2.Change "default value": optional uint32 shuffle_buffer_size = 11 [default = 256] (or try your own) the code is here

models/research/object_detection/protos/input_reader.proto

Line 40 in ce03903

 optional uint32 shuffle_buffer_size = 11 [default = 2048];

original set is :

optional uint32 shuffle_buffer_size = 11 [default = 2048]

the default value is 2048, it's too big for batch_size=1, should be modified accordingly, it consumes a lot of RAM in my opinion.

3.Recompile Protobuf libraries

From tensorflow/models/research/

protoc object_detection/protos/*.proto --python_out=.

edited Jun 13 '19 at 05:37

Avinash Singh

4,970
8
20
35

answered Jun 13 '19 at 03:33

liuchangf

51
1
3

I am also experiencing this issue. What do you suggest I change `optional uint32 shuffle_buffer_size = 11 [default = 2048]` to? – Alexis Winters Oct 03 '19 at 17:01
1

chang from `optional uint32 shuffle_buffer_size = 11 [default = 2048]` to `optional uint32 shuffle_buffer_size = 11 [default = 256]` , if `256` can not meet requirements ,you can modify it. – liuchangf Oct 08 '19 at 03:13

Deepak Raj · Answer 2 · 2021-03-25T15:40:34.173

4

In your pipeline.config, Add the

shuffle_buffer_size: 200

or as according to your system.

train_input_reader {
  shuffle_buffer_size: 200
  label_map_path: "tfrecords/label_map.pbtxt"
  tf_record_input_reader {
    input_path: "tfrecords/train.record"
  }
}

It's working for me, tested on tf1 and tf2 as well.

edited Mar 25 '21 at 15:40

answered Mar 18 '21 at 03:49

Deepak Raj

462
5
15

zhai · Answer 3 · 2020-01-17T08:08:26.613

1

I change flow_from_directory to flow_from_dataframe function. Because it doesn't upload the matrix values of all images to memory.

edited Jan 17 '20 at 08:08

answered Jan 17 '20 at 05:00

zhai

181
3
5

Tensorflow object detection API killed - OOM. How to reduce shuffle buffer size?

System information

Describe the problem

Traceback

3 Answers3

Linked