Queries regarding checkpoints of Object Detection API

Question

I have a few queries regarding the Tensorflow Object Detection API.

While training, only the previous 5 check-points are stored. I want to store more than that, say the previous 10 check points. How can this be done? (I think it should be one of the parameters of train.proto in object_detection/protos.)
By default, the check points are stored every 10 minutes (600 seconds). To change this frequency, I believe it is one of these two parameters that have to be changed, please confirm which one it is:

from learning.py in /home/user/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim

save_summaries_secs=600 or

save_interval_secs=600
While training my model (ssd_mobilenet_v2_coco_2018_03_29), I also run the evaluation simultaneously. The latest checkpoint represented in the eval graph always lags the latest one saved in object_detection/training folder. For example, in the case below, the latest checkpoint shown on graph is 29.437k, while the model is already trained till the checkpoint 32.891k (and saved in the training folder). What is the reason for this lag (20 minutes lag) Why isn't one step (10 minutes) enough to perform evaluation on the trained model?

This post here should work i believe to change keep_checkpoint_every_n_hours — Srinivas Bringu, Aug 25 '18 at 22:02
For the second point this solution worked for me: https://github.com/tensorflow/models/issues/5139#issuecomment-418963839. For example to save the model after each 1000 steps change the line (mentioned in the solution in the link) from this: "config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir)" to this: "config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, save_checkpoints_steps = 1000)" — hafiz031, Jul 28 '20 at 03:30

junbong jang · Answer 1 · 2020-07-17T16:12:44.447

This is for anyone who wants to configure the updated object detection API that supports TensorFlow 2

To save the previous 10 checkpoints, open model_lib.py and pass keyword argument max_to_keep=10 to every tf.train.Saver function
To change the frequency from 600 seconds to 3600 seconds (1 hour), open model_main.py and find the line that contains tf.estimator.RunConfig in the main function.
Pass the keyword argument save_checkpoints_secs=3600 to the tf.estimator.RunConfig class.

Here is the code snippet after configuring checkpoint save frequency in model_main.py:

def main(unused_argv):
      flags.mark_flag_as_required('model_dir')   
      flags.mark_flag_as_required('pipeline_config_path')   
      config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, save_checkpoints_secs=3600)

please note that there is a parameter keep_checkpoint_max in the tf.estimator.RunConfig class but setting it didn't affect the number of saved checkpoints for me.

score 0 · Answer 2 · answered Aug 25 '18 at 22:03

This post here should work i believe to change keep_checkpoint_every_n_hours max_to_keep

How to store best models checkpoints, not only newest 5, in Tensorflow Object Detection API?

You can also refer official doc https://www.tensorflow.org/api_docs/python/tf/train/Saver

Queries regarding checkpoints of Object Detection API

2 Answers2