1

I am pretty new in TensorFlow. I am currently curious to track the IO time and bandwidth (preferably percentage of IO time taken in the training process for checkpointing) for checkpointing which is performed by the internal checkpointing mechanism provided by high level tf.train.MonitoredTrainingSession that can be implemented through adding a tf.train.CheckpointSaverHook while initializing the tf.train.MonitoredTrainingSession.

I am thinking about using a tf.train.CheckpointSaverListener (i.e. using before_save and after_save methods) to log the time and track IO. But I have a question, will this logging technique I am thinking about give me a proper percentage calculation (i.e. Time taken for checkpointing IO / Time taken for Training * 100%) ?

I am suspecting that, this checkpointing is done asynchronously through a thread different from training. I have been looking into the TensorFlow code to find this out, but I thought asking this question here can accelerate my exploration.

I am open to any suggestion on using any other alternative technique (e.g. using TensorBoard, IO profiling tools etc.)

Fahim
  • 11
  • 2

1 Answers1

1

I believe it will.

The checkpointing isn't done asynchronously. You'd want the checkpoint to contain a consistent snapshot of the variables/parameters and thus do not want to checkpoint asynchronously with other operations that may update the parameter values.

The CheckpointSaverHook explicitly uses the Session to execute the operation that saves the checkpoint (source code) and waits for it to complete (It's basically invoking tf.train.Saver.save).

So, the CheckpointSaverListener you thought of should work out fine - modulo the time taken by any other CheckpointSaverListeners in your program.

Hope that helps.

ash
  • 6,681
  • 3
  • 18
  • 30