0

I am a beginner with tensorflow and ML, pardon any obvious mistakes or noob questions.

I'm currently working on an object detection problem and experience issues with memory capacity on GPU when training with a batch size which is not equal to 1. See picture picture for GPU and CUDA info during training.

I'm using the Faster R-CNN Inpcetion v2 model from Tensorflow Github.

train.record file is 753,5 MB.

Can this problem be solved with a more efficient input pipeline or are the models on tensorflow's github already optimalized? Should I change the network architecture to reduce the amount of variables? Is batch size 1 the only/best option for best accuracy?

I'm trying to learn this the best I can, if any more info is needed please ask.

Model config:

model {
  faster_rcnn {
    num_classes: 3
    image_resizer {
      fixed_shape_resizer {
      height: 200
      width: 200
      }
    }
    feature_extractor {
      type: 'faster_rcnn_inception_resnet_v2'
      first_stage_features_stride: 8
    }
    first_stage_anchor_generator {
     # grid_anchor_generator {
     #   scales: [0.25, 0.5, 1.0, 2.0, 3.0]
     #   aspect_ratios: [0.25,0.5, 1.0, 2.0]
     #   height_stride: 8
     #   width_stride: 8
     # }
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0, 3.0]
        aspect_ratios: [1.0, 2.0, 3.0]
        height: 64
        width: 64 
        height_stride: 8
        width_stride: 8
      }
    }
    first_stage_atrous_rate: 2
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.01
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.4
    first_stage_max_proposals: 100
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 17
    maxpool_kernel_size: 1
    maxpool_stride: 1
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: True
        dropout_keep_probability: 0.9
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.01
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.5
        max_detections_per_class: 20
        max_total_detections: 20
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 32
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0002
          schedule {
            step: 50000
            learning_rate: .00002
          }
          schedule {
            step: 100000
            learning_rate: .000002
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0


# PATH_TO_BE_CONFIGURED: Below line needs to match location of model checkpoint: Either use checkpoint from rcnn model, or checkpoint from previously trained model on other dataset. 
  fine_tune_checkpoint: "...model.ckpt"

  from_detection_checkpoint: true
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  # num_steps: 200000

  data_augmentation_options {
    random_horizontal_flip {}
  }
  data_augmentation_options {
    random_crop_image {
    min_object_covered : 1.0
    min_aspect_ratio: 0.5
    max_aspect_ratio: 2
    min_area: 0.2
    max_area: 1.
      }
  }
  data_augmentation_options {
    random_distort_color {}
  }
}



# PATH_TO_BE_CONFIGURED: Need to make sure folder structure below is correct for both train-record and label_map.pbtxt
train_input_reader: {
  tf_record_input_reader {
    input_path: "...train.record"
  }
    label_map_path: ".../label_map/label_map.pbtxt"
  queue_capacity: 500
  min_after_dequeue: 250
}



#PATH_TO_BE_CONFIGURED: Make sure folder structure for eval_export, validation.record and label_map.pbtxt below are correct. 
eval_config: {
  num_examples: 30
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
  num_visualizations: 30
  eval_interval_secs: 600
  visualization_export_dir: "...eval_export"
}



eval_input_reader: {
  tf_record_input_reader {
    input_path: "/...test.record"
  }
    label_map_path: "/...label_map.pbtxt"
  shuffle: True
  num_readers: 1
}

Error message:

Caused by op 'CropAndResize', defined at:
  File "...models/research/object_detection/model_main.py", line 103, in <module>
    tf.app.run()
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "...models/research/object_detection/model_main.py", line 99, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/...models/research/object_detection/model_lib.py", line 252, in model_fn
    preprocessed_images, features[fields.InputDataFields.true_image_shape])
  File "...models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 680, in predict
    self._anchors.get(), image_shape, true_image_shapes))
  File "/...models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 771, in _predict_second_stage
    rpn_features_to_crop, proposal_boxes_normalized))
  File "...models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1498, in _compute_second_stage_input_feature_maps
    (self._initial_crop_size, self._initial_crop_size))
  File "/...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/ops/gen_image_ops.py", line 390, in crop_and_resize
    extrapolation_value=extrapolation_value, name=name)
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2048,17,17,1088] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node CropAndResize (defined at ...models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py:1498) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[node control_dependency (defined at ...models/research/object_detection/model_lib.py:345) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
fendrbud
  • 89
  • 1
  • 11
  • The batch size is mostly irrelevant to how accurate your model will be. It's generally, however, significantly faster to train a model by making the batch size as large as is possible. I'm not sure what your code is, so I can't much assist. However, from the error, a `[2048,17,17,1088]` float tensor occupies 2048 * 17 * 17 * 1088, at 4 bytes per 32-bit float ~ 2.4Gb. That's a quarter of your memory. The 17's may be from your grid anchor generator (200-64)/8 = 17 (img_size - grid_size)/(stride). Not familiar enough with the exact workings of the algorithm to comment on the rest. – Him Jul 02 '19 at 14:26
  • 1
    @Scott You're right that too small a batch size can make the overhead of transferring data to the GPU noticable. However, above a certain batch size, there's not really a noticable difference. Smaller batch sizes can make the model converge much faster though, due to the stochastic element. So the optimal batch size will likely change depending on your model. But in general you might want to aim for as small a batch size as possible, without being too small for you to suffer performance loss from overhead. – user2653663 Sep 06 '19 at 16:29
  • @Him That is not totally valid on practice, taking into account that the gradient for a very small batch size may be too noisy. Some works cope with that (https://par.nsf.gov/servlets/purl/10108626), but it keeps a struggle to train Mask RCNN for COCO (that is the question I still looking answer for) – A.Ametov Jan 09 '21 at 12:54

1 Answers1

1

I think you should change batch_size line to : batch_size: 1