0

I've been following this tutorial on the Tensorflow Object Detection API, and I've successfully trained my own object detection model using Google's Cloud TPUs.

However, the problem is that on Tensorboard, the plots I'm seeing only have 2 data points each (so it just plots a straight line), like this: enter image description here

...whereas I want to see more "granular" plots like these below, which are much more detailed:

enter image description here

The tutorial I've been following acknowledges that this issue is caused by the fact that TPU training requires very few steps to train:

Note that these graphs only have 2 points plotted since the model trains quickly in very few steps (if you’ve used TensorBoard before you may be used to seeing more of a curve here)

I tried adding save_checkpoints_steps=50 in the file model_tpu_main.py (see code fragment below), and when I re-ran training, I was able to get a more granular plot, with 1 data point every 300 steps or so.

config = tf.contrib.tpu.RunConfig(
      # I added this line below:
      save_checkpoints_steps=50,

      master=tpu_grpc_url,
      evaluation_master=tpu_grpc_url,
      model_dir=FLAGS.model_dir,
      tpu_config=tf.contrib.tpu.TPUConfig(
          iterations_per_loop=FLAGS.iterations_per_loop,
          num_shards=FLAGS.num_shards))

enter image description here

However, my training job is actually saving a checkpoint every 100 steps, rather than every 300 steps. Looking at the logs, my evaluation job is running every 300 steps. Is there a way I can make my evaluation job run every 100 steps (whenever there's a new checkpoint) so that I can get more granular plots on Tensorboard?

Auberon López
  • 268
  • 1
  • 9
Horace Lee
  • 146
  • 6

3 Answers3

1

Code which addresses this issue is explained by a technical lead for the Google cloud platform in a Medium blogpost. Alternatively go directly to the Github code.

The train_and_evaluate function of 81 lines defines an TPUEstimator, train_input_fn and eval_input_fn. Then it iterates to the training steps and calls estimator.train and estimator.evaluate in each iteration. The metrics can be defined in the model_fn, which is called image_classifier. Note that it currently has no effect to add tf.summary calls in the model functions since the TPU does not support it:

"TensorBoard summaries are a great way see inside your model. A minimal set of basic summaries are automatically recorded by the TPUEstimator, to event files in the model_dir. Custom summaries, however, are currently unsupported when training on a Cloud TPU. So while the TPUEstimator will still run locally with summaries, it will fail if used on a TPU." (source)

If summaries are important it might be more convenient to switch to training on GPU.

Personally I think writing this code is quite a hassle for something which should be handled by the API. Please update this answer if better solutions exist! I'm looking forward to it.

RikH
  • 2,994
  • 1
  • 16
  • 15
  • Thanks for your answer. At the moment I'm still a beginner at Tensorflow so I didn't entirely understand what you were talking about, and also I'm not too bothered to solve this issue for now, but I hope that other people can upvote your answer if they find it useful. – Horace Lee Dec 03 '18 at 01:31
0

Set save_summary_steps in RunConfig to 100, so you get the statistics you want Also iterations_per_loop to 100 so that the training doesn't go more steps

p.s. I hope you realize that checkpointing is very slow. You are probably raising the cost of your job just for the sake of a pretty graph :)

Lak
  • 3,876
  • 20
  • 34
0

You can try adding throttle_secs=100 to the EvalSpecs constructor here. The default is 600 seconds.

Zhichao Lu
  • 244
  • 2
  • 6
  • Thanks. I'll try it and get back to you later. – Horace Lee Aug 31 '18 at 08:01
  • I tried `throttle_secs=100` and it didn't work (my evaluation job still performs an evaluation every 300 steps). However I'll try a lower value like `throttle_secs=60` and let you know the result. – Horace Lee Sep 03 '18 at 06:40
  • Could you try also adding save_checkpoints_secs=100 keyword argument to RunConfig here;https://github.com/tensorflow/models/blob/master/research/object_detection/model_main.py#L55 – Zhichao Lu Sep 04 '18 at 07:17
  • It's already using `save_checkpoints_steps=50` (though it's `steps` instead of `secs`) – Horace Lee Sep 17 '18 at 08:07
  • Also I tried `throttle_secs=60` and the evaluation job still performs an evaluation every 300 steps – Horace Lee Sep 17 '18 at 16:49