1

Suppose we have a simple TensorFlow model with a few convolutional layers. We like to train this model on a cluster of computers that is not equipped with GPUs. Each computational node of this cluster might have 1 or multiple cores. Is it possible out-of-the-box? If not, which packages are able to do that? Are those packages able to perform data and model parallelism?

Mona Jalal
  • 34,860
  • 64
  • 239
  • 408
taless
  • 166
  • 1
  • 5

1 Answers1

0

According to Tensorflow documentation

tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines or TPUs.

As mentioned above it supports CPU for distributed training by considering all devices should be in the same network.

Yes, you can use multiple devices for training model and need to have cluster and worker configuration to be done on couple of devices as shown below.

tf_config = {
    'cluster': {
        'worker': ['localhost:1234', 'localhost:6789']
    },
    'task': {'type': 'worker', 'index': 0}
}

To know about the configuration and training model, please refer Multi-worker training with Keras.

According to this SO answer

tf.distribute.Strategy is integrated to tf.keras, so when model.fit is used with tf.distribute.Strategy instance and then using strategy.scope() for your model allows to create distributed variables. This allows it to equally divide your input data on your devices.

Note: One can be benefited using distributed training when dealing with huge data and complex models (i.e. w.r.t performance).