I am a beginner with tensorflow and ML, pardon any obvious mistakes or noob questions.
I'm currently working on an object detection problem and experience issues with memory capacity on GPU when training with a batch size which is not equal to 1. See picture picture for GPU and CUDA info during training.
I'm using the Faster R-CNN Inpcetion v2 model from Tensorflow Github.
train.record file is 753,5 MB.
Can this problem be solved with a more efficient input pipeline or are the models on tensorflow's github already optimalized? Should I change the network architecture to reduce the amount of variables? Is batch size 1 the only/best option for best accuracy?
I'm trying to learn this the best I can, if any more info is needed please ask.
Model config:
model {
faster_rcnn {
num_classes: 3
image_resizer {
fixed_shape_resizer {
height: 200
width: 200
}
}
feature_extractor {
type: 'faster_rcnn_inception_resnet_v2'
first_stage_features_stride: 8
}
first_stage_anchor_generator {
# grid_anchor_generator {
# scales: [0.25, 0.5, 1.0, 2.0, 3.0]
# aspect_ratios: [0.25,0.5, 1.0, 2.0]
# height_stride: 8
# width_stride: 8
# }
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0, 3.0]
aspect_ratios: [1.0, 2.0, 3.0]
height: 64
width: 64
height_stride: 8
width_stride: 8
}
}
first_stage_atrous_rate: 2
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.01
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.4
first_stage_max_proposals: 100
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 17
maxpool_kernel_size: 1
maxpool_stride: 1
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: True
dropout_keep_probability: 0.9
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.01
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.5
max_detections_per_class: 20
max_total_detections: 20
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 32
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0002
schedule {
step: 50000
learning_rate: .00002
}
schedule {
step: 100000
learning_rate: .000002
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
# PATH_TO_BE_CONFIGURED: Below line needs to match location of model checkpoint: Either use checkpoint from rcnn model, or checkpoint from previously trained model on other dataset.
fine_tune_checkpoint: "...model.ckpt"
from_detection_checkpoint: true
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
# num_steps: 200000
data_augmentation_options {
random_horizontal_flip {}
}
data_augmentation_options {
random_crop_image {
min_object_covered : 1.0
min_aspect_ratio: 0.5
max_aspect_ratio: 2
min_area: 0.2
max_area: 1.
}
}
data_augmentation_options {
random_distort_color {}
}
}
# PATH_TO_BE_CONFIGURED: Need to make sure folder structure below is correct for both train-record and label_map.pbtxt
train_input_reader: {
tf_record_input_reader {
input_path: "...train.record"
}
label_map_path: ".../label_map/label_map.pbtxt"
queue_capacity: 500
min_after_dequeue: 250
}
#PATH_TO_BE_CONFIGURED: Make sure folder structure for eval_export, validation.record and label_map.pbtxt below are correct.
eval_config: {
num_examples: 30
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
max_evals: 10
num_visualizations: 30
eval_interval_secs: 600
visualization_export_dir: "...eval_export"
}
eval_input_reader: {
tf_record_input_reader {
input_path: "/...test.record"
}
label_map_path: "/...label_map.pbtxt"
shuffle: True
num_readers: 1
}
Error message:
Caused by op 'CropAndResize', defined at:
File "...models/research/object_detection/model_main.py", line 103, in <module>
tf.app.run()
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "...models/research/object_detection/model_main.py", line 99, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
return self.run_local()
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
saving_listeners=saving_listeners)
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/...models/research/object_detection/model_lib.py", line 252, in model_fn
preprocessed_images, features[fields.InputDataFields.true_image_shape])
File "...models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 680, in predict
self._anchors.get(), image_shape, true_image_shapes))
File "/...models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 771, in _predict_second_stage
rpn_features_to_crop, proposal_boxes_normalized))
File "...models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1498, in _compute_second_stage_input_feature_maps
(self._initial_crop_size, self._initial_crop_size))
File "/...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/ops/gen_image_ops.py", line 390, in crop_and_resize
extrapolation_value=extrapolation_value, name=name)
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2048,17,17,1088] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node CropAndResize (defined at ...models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py:1498) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[node control_dependency (defined at ...models/research/object_detection/model_lib.py:345) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.