5

I have a deep CNN/RNN that I train on Google AI platform. I distribute the training on 8 GPUs using the tf.distribute.MirroredStrategy. I recently upgraded my runtime version from 1.13 to 1.15 and my training is more than 2x slower than before. I read that tf.estimator.ProfilerHook can be used to identify performance bottlenecks. So I collected the profiling information and rendered it at chrome://tracing. I got this

profiling screenshot

A training step spends an entire 1 second on these _Send ops. What is this? I can't find any documentation on the op or why it's in my graph. What does this mean?

Andy Carlson
  • 3,633
  • 24
  • 43

0 Answers0