1

I'm a newbie to Tensorflow. I have been learning how to use TensorFlow to train models in a distributed manner and I have access to multiple servers, each with multiple CPUs.

Training mechanisms are clearly outlined in documentation and tutorials, but there are some ambiguities regarding data management while training multiple workers. In my understanding, data should be shared and stored on a single machine, and tf.distribute.DistributedDataset distributes data among workers.

Is my understanding that shared data is stored on one machine correct?

Think of a situation where we have multiple workers in our network and we want to train a model for 10 epochs on a large dataset. Is it true that tf.distribute.DistributedDataset sends data to workers 10 times? Are there any mechanisms to prevent the same batches of data from being sent to the same worker ten times?

This post, for instance, states that:

Spark and HDFS are designed to work well together. When Spark needs some data from HDFS, it grabs the closest copy which minimizes the time data spends traveling around the network.

I'm looking for something similar for Tensorflow's distributed training.

smjfas
  • 43
  • 1
  • 1
  • 6

0 Answers0