While training a DNN in a distributed manner, I would like to use Local SGD (also known as K-AVG SGD or Parallel SGD) for reducing the communication overhead by diminishing the number of synchronization points.
However, I am unable to find an implementation of Local SGD in TensorFlow. Do you have any experience with using this communication model?
Observation: This paper explains the benefits of using Local SGD.