2

How can I specify the interval between 2 consecutive checkpoints in tensorflow? There are no options in tf.train.Saver to specify that. Every time, I run the model with a different number of global steps, I get a new interval between checkpoints

Ahmed Ashour
  • 5,179
  • 10
  • 35
  • 56
Safaa
  • 101
  • 1
  • 3

2 Answers2

3

The tf.train.Saver is a "passive" utility for writing checkpoints, and it only writes a checkpoint when some other code calls its .save() method. Therefore, the rate at which checkpoints are written depends on what framework you are using to train your model:

  • If you are using the low-level TensorFlow API (tf.Session) and writing your own training loop, you can simply insert calls to Saver.save() in your own code. A common approach is to do this based on the iteration count:

    for i in range(NUM_ITERATIONS):
      sess.run(train_op)
      # ...
      if i % 1000 == 0:
        saver.save(sess, ...)  # Write a checkpoint every 1000 steps.
    
  • If you are using tf.train.MonitoredTrainingSession, which writes checkpoints for you, you can specify a checkpoint interval (in seconds) in the constructor. By default it saves a checkpoint every 10 minutes. To change this to every minute, you would do:

    with tf.train.MonitoredTrainingSession(..., save_checkpoint_secs=60):
      # ...
    
mrry
  • 125,488
  • 26
  • 399
  • 400
0

Thanks! This fixed my problem: tf.contrib.slim.learning.train( train_op, checkpoint_dir, log_every_n_steps=args.log_every_n_steps, graph=g,
global_step=model.global_step, number_of_steps=args.number_of_steps, init_fn=model.init_fn, save_summaries_secs=300, save_interval_secs=300, saver=saver)

Safaa
  • 101
  • 1
  • 3