0

So I'm trying to retrain a fast_rcnn object detection model, with just one class, which I've attempted to run both locally (on a VM) and through ML engine. I keep running into the same error in regards to the train_config file however, which is an adaptation of the faster_rcnn_resnet50_coco.config configuration:

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 171, in tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 142, in main tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate return executor.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 637, in run getattr(self, task_to_run)() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 674, in run_master self._start_distributed_training(saving_listeners=saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 788, in _start_distributed_training saving_listeners=saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1234, in _train_model_default input_fn, model_fn_lib.ModeKeys.TRAIN)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1075, in _get_features_and_labels_from_input_fn self._call_input_fn(input_fn, mode)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1162, in _call_input_fn return input_fn(**kwargs) File "/root/.local/lib/python2.7/site-packages/trainer/object_detection/inputs.py", line 375, in _train_input_fn raise TypeError('For training mode, the train_config must be a ' TypeError: For training mode, the train_config must be a train_pb2.TrainConfig.

I've spent a long time looking for the potential cause of this issue in my config file but I can't see what the problem is. There doesn't seem to be any documentation mentioning this apart from the TF source code itself. Any insight would be greatly appreciated!

    model {
  faster_rcnn {
    num_classes: 1
    image_resizer {
      fixed_shape_resizer {
        height: 600
        width: 205
      }
    }
    feature_extractor {
      type: 'faster_rcnn_resnet50'
      first_stage_features_stride: 16
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
      }
    }
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 14
    maxpool_kernel_size: 2
    maxpool_stride: 2
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 300
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}
train_config: {
  batch_size: 5
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.0003
          decay_steps: 500
          decay_factor: 0.9
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "gs://ml-pipeline/checkpoints/fast_rcnn_resnet50/model.ckpt-5500"
  from_detection_checkpoint: true
  load_all_detection_checkpoint_vars: true
  num_steps: 2000
  data_augmentation_options {
    normalize_image {
    }
    random_pixel_value_scale {
    }
    random_adjust_brightness {
    }
    random_jitter_boxes {
    }
    random_pad_image {
    }
  }
  max_number_of_boxes: 35
}
train_input_reader: {
  tf_record_input_reader {
    input_path: "gs://ml-pipeline/data/tf-records/train.record"
  }
  label_map_path: "gs://ml-pipeline/story_label_map.pbtxt"
}
eval_config {
  num_examples: 54
  num_visualizations: 54
  eval_interval_secs: 10
  max_evals: 1
  #use_moving_averages: false
}
eval_input_reader: {
  tf_record_input_reader {
    input_path: "gs://ml-pipeline/data/tf-records/test.record"
  }
  label_map_path: "gs://ml-pipeline/story_label_map.pbtxt"
  shuffle: false
  num_readers: 1
}
Bobbi
  • 1
  • 2

1 Answers1

0

I didn't see anything obviously wrong here. Could you do this for debugging:

add print type(configs['train_config']) and print configs['train_config'] to here

and let me know what are printed?

Zhichao Lu
  • 244
  • 2
  • 6
  • You should also rebuild the protos with this (https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/installation.md#protobuf-compilation) cmd. – Zhichao Lu Mar 21 '19 at 19:21
  • Thanks for the reply, I tried both of these and got this output: – Bobbi Mar 27 '19 at 11:28
  • batch_size: 5 data_augmentation_options { random_pad_image { } } optimizer { momentum_optimizer { learning_rate { exponential_decay_learning_rate { initial_learning_rate: 0.000300000014249 decay_steps: 500 decay_factor: 0.899999976158 } } momentum_optimizer_value: 0.899999976158 } use_moving_average: false } gradient_clipping_by_norm: 10.0 fine_tune_checkpoint: "gs://sensible-ml-pipeline/checkpoints/fast_rcnn_resnet50/model.ckpt" from_detection_checkpoint: true num_steps: 2000 – Bobbi Mar 27 '19 at 11:29
  • load_all_detection_checkpoint_vars: true max_number_of_boxes: 35 This seems odd as this seems to be of the correct type, unless I'm mistaken! – Bobbi Mar 27 '19 at 11:31
  • Indeed this is odd. Could you sync to HEAD, rebuild proto and try again? If type is matched, we shouldn't enter that if branch. Could you also print the value of instance(train_config, train_pb2.TrainConfig)? – Zhichao Lu Mar 28 '19 at 16:51
  • I reverted my git repository back to the original and rebuilt the protobuf compiler manually before compiling again, but I'm still running into this error strangely! With print( type(config) ) I got: `` and print (train_config, train_pb2.TrainConfig) resulted in the same output: – Bobbi Mar 29 '19 at 10:19
  • `batch_size: 5 data_augmentation_options { random_pad_image { } }optimizer { momentum_optimizer { learning_rate { exponential_decay_learning_rate { initial_learning_rate: 0.000300000014249 decay_steps: 500 decay_factor: 0.899999976158 } } momentum_optimizer_value: 0.899999976158 } use_moving_average: false } gradient_clipping_by_norm: 10.0 fine_tune_checkpoint: "gs://ml-pipeline/checkpoints/fast_rcnn_resnet50/model.ckpt" from_detection_checkpoint: true num_steps: 2000 load_all_detection_checkpoint_vars: true max_number_of_boxes: 35` – Bobbi Mar 29 '19 at 10:20
  • `` The config type seems to match with both print statements so I'm still confused, thanks for your help with this! – Bobbi Mar 29 '19 at 10:21
  • In this case you should not enter line 483 (https://github.com/tensorflow/models/blob/master/research/object_detection/inputs.py#L483). Could you please print the value of isinstance(train_config, train_pb2.TrainConfig) ? – Zhichao Lu Apr 01 '19 at 05:35
  • Strangely enough this is returning False. Do you think migrating our codebase to TF v2 would help overcome this issue? – Bobbi Apr 01 '19 at 09:04
  • No I don't think so. OD API doesn't support tf v2 yet. Could you share the latest error message? – Zhichao Lu Apr 02 '19 at 16:48
  • I'm still returning the same error message with the config file, however when I try `print( type(train_pb2.TrainConfig) ) I'm getting 'google.protobuf.pyext.cpp_message.GeneratedProtocolMessageType'.` Is this right? – Bobbi Apr 08 '19 at 10:17
  • I'm thinking there could possibly be an issue with the protobuf compiler? I've manually rebuilt the protos using protoc 3.0, which I believe is the correct version for Linux, and when I do `pip show protobuf` I'm using the protobuf version 3.6.1. The thing is I'm running my script on a pre-built deep learning VM, so I built Tensorflow from one of the binaries that were pre-installed for the particular image. Do you think I could be using an incompatible version of TF for the protobuf compiler? – Bobbi Apr 08 '19 at 10:43
  • "train_config" should be an train_pb2.TrainConfig object, while "train_pb2.TrainConfig" should be a proto message, so that is expected. – Zhichao Lu Apr 08 '19 at 17:40
  • For now I think you could do 2 things: 1. clean up your existing protobuf, and re-install it following this doc (https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/installation.md#manual-protobuf-compiler-installation-and-usage) 2. Above line 482 here(https://github.com/tensorflow/models/blob/master/research/object_detection/inputs.py#L482) add print(train_config), print(type(train_config)), print(isinstance(train_config, train_pb2.TrainConfig)). and let me know what are the outputs, including error messages. You can update your question with those info – Zhichao Lu Apr 08 '19 at 17:45
  • I uninstalled the pip version of protobuf and reinstalled according to the documentation. However, I'm now getting the following traceback: – Bobbi Apr 09 '19 at 13:44
  • File "trainer/task.py", line 24, in import tensorflow as tf File "/usr/local/lib/python2.7/dist-packages/tensorflow/__init__.py", line 24, in from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/__init__.py", line 59, in from tensorflow.core.framework.graph_pb2 import * File "/usr/local/lib/python2.7/dist-packages/tensorflow/core/framework/graph_pb2.py", line 15, in from tensorflow.core.framework import node_def_pb2 as – Bobbi Apr 09 '19 at 13:45
  • tensorflow_dot_core_dot_framework_dot_node__def__pb2 File "/usr/local/lib/python2.7/dist-packages/tensorflow/core/framework/node_def_pb2.py", line 15, in from tensorflow.core.framework import attr_value_pb2 as tensorflow_dot_core_dot_framework_dot_attr__value__pb2 File "/usr/local/lib/python2.7/dist-packages/tensorflow/core/framework/attr_value_pb2.py", line 15, in from tensorflow.core.framework import tensor_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__pb2 File "/usr/local/lib/python2.7/dist-packages/tensorflow/core/framework/tensor_pb2.py" – Bobbi Apr 09 '19 at 13:46
  • line 15, in from tensorflow.core.framework import resource_handle_pb2 as tensorflow_dot_core_dot_framework_dot_resource__ha ndle__pb2 File "/usr/local/lib/python2.7/dist-packages/tensorflow/core/framework/resource_handle_pb2.py", line 22, in – Bobbi Apr 09 '19 at 13:50
  • serialized_pb=_b('\n/tensorflow/core/framework/resource_handle.proto\x12\ntensorflow\"r\n\x13ResourceHandleProt o\x12\x0e\n\x06\x64\x65vice\x18\x01 \x01(\t\x12\x11\n\tcontainer\x18\x02 \x01(\t\x12\x0c\n\x04name\x18\x03 \x01(\t\ x12\x11\n\thash_code\x18\x04 \x01(\x04\x12\x17\n\x0fmaybe_type_name\x18\x05 \x01(\tBn\n\x18org.tensorflow.framework B\x0eResourceHandleP\x01Z=github.com/tensorflow/tensorflow/tensorflow/go/core/framework\xf8\x01\x01\x62\x06proto3') TypeError: __new__() got an unexpected keyword argument 'serialized_options' – Bobbi Apr 09 '19 at 13:50
  • Can you file an issue on github and provide everything here and the requested "System information"? It's hard to discuss here. Regarding the proto issue, could you try uninstall and reinstall v3.4? – Zhichao Lu Apr 10 '19 at 17:57
  • Thanks, I've submitted the form to Github now, though I wasn't sure if this was the right place to file it so sorry if it isn't! Thanks so much for your help! https://github.com/tensorflow/tensorflow/issues/27748 – Bobbi Apr 11 '19 at 09:25