I am pretty new in TensorFlow. I am currently curious to track the IO time and bandwidth (preferably percentage of IO time taken in the training process for checkpointing) for checkpointing which is performed by the internal checkpointing mechanism provided by high level tf.train.MonitoredTrainingSession
that can be implemented through adding a tf.train.CheckpointSaverHook
while initializing the tf.train.MonitoredTrainingSession
.
I am thinking about using a tf.train.CheckpointSaverListener
(i.e. using before_save
and after_save
methods) to log the time and track IO. But I have a question, will this logging technique I am thinking about give me a proper percentage calculation (i.e. Time taken for checkpointing IO / Time taken for Training * 100%
) ?
I am suspecting that, this checkpointing is done asynchronously through a thread different from training. I have been looking into the TensorFlow code to find this out, but I thought asking this question here can accelerate my exploration.
I am open to any suggestion on using any other alternative technique (e.g. using TensorBoard, IO profiling tools etc.)